Music emotion recognition method and device based on graph convolution network, equipment and medium
By using a graph convolutional network-based approach, this method extracts audio features using emotion tag graphs and two-dimensional convolutional neural networks, solving the problem of low accuracy in music emotion recognition in existing technologies and achieving more efficient music emotion recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2023-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Current technologies for music emotion recognition have low accuracy and rely on time-consuming and labor-intensive manual annotation. Existing algorithms, such as support vector machine classifiers, have insufficient accuracy.
A graph convolutional network-based approach is adopted. An initial vector is obtained by pre-setting sentiment label encoding, a sentiment label graph is constructed and graph convolution processing is performed, audio features are extracted by combining a two-dimensional convolutional neural network, and a sentiment classifier is used for recognition.
It improves the accuracy and precision of music emotion recognition, reduces reliance on manual annotation, and lowers recognition costs.
Smart Images

Figure CN116524963B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of emotion analysis technology, and in particular to a method, apparatus, device, and medium for music emotion recognition based on graph convolutional networks. Background Technology
[0002] With the continuous development of technology, the music market has developed rapidly, and people can access a vast amount of music resources through various channels. Music has always been a way to express and convey emotions; therefore, accurately identifying the emotions in music is extremely important.
[0003] In existing technologies, music emotion classification and labeling are often done through manual listening and identification. However, the classification of emotions depends more on subjective human feelings and is affected by many external factors. Furthermore, manual labeling is time-consuming and labor-intensive, resulting in high costs for music emotion classification and labeling. Another approach is to use a support vector machine classifier to identify and classify music emotions. However, classic support vector machines only have a binary classification algorithm, which results in low accuracy in music emotion recognition. Summary of the Invention
[0004] This invention provides a method, apparatus, device, and medium for music emotion recognition based on graph convolutional networks, in order to solve the problem of low accuracy in music emotion recognition in the prior art.
[0005] A music emotion recognition method based on graph convolutional networks includes:
[0006] Obtain the initial vector corresponding to each preset sentiment label. The initial vector is obtained by performing graph convolution processing on the sentiment label graph through a preset graph convolution network. The sentiment label graph is constructed and generated based on the initial word vectors obtained after encoding the preset sentiment labels.
[0007] At least one piece of music to be identified is obtained, and features are extracted from all the pieces of music to be identified to obtain audio features corresponding to each piece of music to be identified.
[0008] An emotion classifier is used to perform emotion recognition on all the audio features and all the initial vectors to obtain at least one target emotion label corresponding to the music to be identified.
[0009] A music emotion recognition device based on graph convolutional networks includes:
[0010] The vector acquisition module is used to acquire the initial vector corresponding to each preset sentiment label. The initial vector is obtained by performing graph convolution processing on the sentiment label graph through a preset graph convolution network. The sentiment label graph is constructed and generated based on the initial word vectors obtained after encoding the preset sentiment labels.
[0011] The feature extraction module is used to acquire at least one piece of music to be identified, extract features from all the pieces of music to be identified, and obtain audio features corresponding to each piece of music to be identified.
[0012] The emotion recognition module is used to perform emotion recognition on all the audio features and all the initial vectors through an emotion classifier to obtain at least one target emotion label corresponding to the music to be recognized.
[0013] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the above-described music emotion recognition method based on graph convolutional networks.
[0014] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described music emotion recognition method based on graph convolutional networks.
[0015] This invention provides a method, apparatus, device, and medium for music emotion recognition based on graph convolutional networks. The method obtains initial word vectors by encoding preset emotion tags. By using the initial word vectors as nodes and the probability of two preset emotion tags appearing simultaneously as edge weights, an emotion tag graph is obtained. Furthermore, a preset graph convolutional network is used to perform graph convolution on the emotion tag graph, thereby obtaining the initial vector and improving the accuracy of subsequent music emotion recognition. A two-dimensional convolutional neural network is used to extract features from the Mel spectrum of the music to be recognized, thus obtaining audio features. An emotion classifier is used to perform emotion recognition on the audio features and the initial vector, thereby determining the target emotion tag and improving the precision and accuracy of music emotion recognition. Attached Figure Description
[0016] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a schematic diagram illustrating the application environment of a music emotion recognition method based on graph convolutional networks in one embodiment of the present invention;
[0018] Figure 2 This is a flowchart of a music emotion recognition method based on graph convolutional networks in one embodiment of the present invention;
[0019] Figure 3This is a flowchart of step S10 of a music emotion recognition method based on graph convolutional networks in an embodiment of the present invention;
[0020] Figure 4 This is a flowchart of step S104 of a music emotion recognition method based on graph convolutional networks in an embodiment of the present invention;
[0021] Figure 5 This is a schematic diagram of a music emotion recognition device based on graph convolutional networks according to an embodiment of the present invention;
[0022] Figure 6 This is a schematic diagram of a computer device according to an embodiment of the present invention. Detailed Implementation
[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] The music emotion recognition method based on graph convolutional networks provided in this invention can be applied to, for example... Figure 1 The application environment is shown. Specifically, this music emotion recognition method based on graph convolutional networks is applied in a music emotion recognition device based on graph convolutional networks, which includes, as shown in the example, [details omitted]. Figure 1 The client and server shown communicate over a network to address the low accuracy of music emotion recognition in existing technologies. The server can be a standalone server or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms. The client, also known as the user terminal, refers to the program that provides categorization services to customers, corresponding to the server. The client can be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
[0025] In one embodiment, such as Figure 2 As shown, a music emotion recognition method based on graph convolutional networks is provided, which can be applied to... Figure 1 Taking the server in the example, the following steps are included:
[0026] S10: Obtain the initial vector corresponding to each preset sentiment tag. The initial vector is obtained by performing graph convolution processing on the sentiment tag graph through a preset graph convolution network. The sentiment tag graph is generated based on the initial word vector obtained after encoding the preset sentiment tags.
[0027] Understandably, the preset sentiment labels are words or characters used to represent emotions, such as happy, joyful, and cheerful. The initial vectors are obtained by performing graph convolution processing on the sentiment label graph using a preset graph convolutional network, and are used to represent the preset sentiment labels. The sentiment label graph is constructed and generated based on the initial word vectors obtained after encoding each of the preset sentiment labels.
[0028] Specifically, the initial vector can be collected from different databases or sent from the client to the server. In other words, the initial vector is pre-set. All pre-collected sentiment tags are retrieved and input into a pre-encoding model. The input layer of the pre-encoding model sequentially segments the sentiment tags, adds flags and position vectors, and adds the word vector of each sentiment tag to the corresponding position vector, thus obtaining the input vector. Attention processing is performed on the input vector through the attention layer of the pre-encoding model, that is, multiple attention mechanisms are used to process the input vector separately, and then the results of multiple attention processing are concatenated to obtain the attention vector. Finally, prediction processing is performed through the fully connected layer of the pre-encoding model, that is, the attention vector is first calculated through the fully connected layer, and then prediction is performed using the softmax function, thus obtaining the initial word vector corresponding to the pre-sentiment tag. The initial word vector is used as a node, and the probability of two pre-sentiment tags appearing simultaneously is used as a weight value to obtain the sentiment tag graph. The sentiment tag graph is then processed by a pre-defined graph convolutional network, that is, the sentiment tag graph is processed through two graph convolutional layers to obtain the initial vector.
[0029] S20: Obtain at least one piece of music to be identified, extract features from all the pieces of music to be identified, and obtain audio features corresponding to each piece of music to be identified.
[0030] Understandably, the music to be identified is music for which emotion recognition is required. Audio features are features used to characterize the music to be identified, i.e., latent vectors extracted from the music to be identified.
[0031] Specifically, the music to be identified can be collected from different websites or databases using web scraping techniques, or it can be sent from the client to the server. In this case, the music is stored as a one-dimensional time-series signal. The librosa library in the database is retrieved, and functions within librosa convert the music into a Fourier transform spectrum. This transforms the difficult-to-process one-dimensional time-series signal into easily processed and more information-rich two-dimensional frequency domain data. Then, a two-dimensional convolutional neural network is used to extract features from the Fourier transform spectrum, thereby obtaining the audio features. In this way, the audio features corresponding to each piece of music to be identified are obtained.
[0032] S30: Perform emotion recognition on all the audio features and all the initial vectors using an emotion classifier to obtain at least one target emotion label corresponding to the music to be identified.
[0033] Understandably, a sentiment classifier is a model used for sentiment classification, such as a KNN classifier, a Bayesian classifier, or a VSM classifier. The target sentiment label is a preset sentiment label predicted by the sentiment classifier that corresponds to the music to be identified.
[0034] Specifically, after obtaining the initial vectors, all extracted audio features and all initial vectors are input into the sentiment classifier. This involves multiplying each audio feature by a dot product of all initial vectors before inputting the result into the sentiment classifier. The sentiment classifier then predicts the results of all multiplications by calculating the distance between each audio feature and all initial vectors, and selecting target vectors based on these distances. Finally, all target vectors are sorted in descending order to select a predetermined number (e.g., 2, 3, or 4) of them, and the predefined sentiment labels corresponding to the selected target vectors are determined as the target sentiment labels.
[0035] This invention discloses a music emotion recognition method based on graph convolutional networks. The method obtains initial word vectors by encoding preset emotion tags. By using the initial word vectors as nodes and the probability of two preset emotion tags appearing simultaneously as edge weights, an emotion tag graph is obtained. Further, a preset graph convolutional network is used to perform graph convolution on the emotion tag graph, thereby obtaining the initial vectors and improving the accuracy of subsequent music emotion recognition. A two-dimensional convolutional neural network is used to extract features from the Mel spectrum of the music to be recognized, thus obtaining audio features. An emotion classifier is used to perform emotion recognition on the audio features and the initial vectors, thereby determining the target emotion tag and improving the precision and accuracy of music emotion recognition.
[0036] In one embodiment, such as Figure 3 As shown, step S10, that is, before obtaining the initial vector corresponding to each preset sentiment label, includes:
[0037] S101, Obtain a preset encoding model, and preprocess all the preset emotion tags through the input layer of the preset encoding model to obtain the input vector corresponding to each preset emotion tag.
[0038] Understandably, the input vector is obtained by processing the preset sentiment labels by the input layer of the preset encoding model. The preset encoding model is a pre-trained model used to encode the preset sentiment labels, such as the BERT model.
[0039] Specifically, a pre-defined encoding model is invoked, and all pre-defined sentiment tags are input into it. The input layer of the pre-defined encoding model performs word segmentation on each pre-defined sentiment tag. A CLS flag is added before all words of a given sentiment tag as a start identifier, and a SEP flag is added after all words of all given sentiment tags as a separator. Each word, CLS flag, and SEP flag are vectorized to obtain word vectors, flag vectors corresponding to CLS flags, and flag vectors corresponding to SEP flags. The pre-defined encoding model learns to add a corresponding position vector to each word vector. Finally, all word vectors, position vectors, and flag vectors corresponding to the same pre-defined sentiment tag are concatenated to obtain the input vector corresponding to the pre-defined sentiment tag.
[0040] S102, attention processing is performed on all the input vectors through the attention layer of the preset encoding model to obtain the attention vector corresponding to each preset sentiment label.
[0041] S103, all attention vectors are predicted through the fully connected layer of the preset encoding model to obtain initial word vectors corresponding to each preset sentiment tag.
[0042] Understandably, the attention vector is obtained by performing attention processing on the input vector. The initial word vector is a vectorized representation of a predefined sentiment label.
[0043] Specifically, after obtaining the input vectors, attention processing is performed on all input vectors through multiple attention mechanisms. This involves calculating the Q, K, and V vectors within the input vectors using multiple attention mechanisms. Specifically, the dot product method is used to calculate the correlation score between the Q and K vectors in the input vectors; that is, the dot product is calculated between each input vector in Q and each input vector in K, and the correlation score between the Q and K vectors is normalized. Then, the softmax function is used to transform the score vector between the input vectors into a probability distribution between [0, 1]. Based on the probability distribution between the input vectors, the corresponding Values value is multiplied to obtain the attention result. Finally, the attention results from different groups are concatenated to obtain the attention vector. Further, the attention vector is non-linearly transformed using a feedback neural network in a fully connected layer to obtain a transformed vector. This transformed vector is then activated using an activation function to obtain an activation vector. The activation vector is then subjected to residual processing, and the process is repeated for a predetermined number of encoding layers (e.g., 12 or 24) to obtain the initial word vector corresponding to the predetermined sentiment label.
[0044] S104, construct a sentiment tag graph based on all the initial word vectors, and perform graph convolution processing on the sentiment tag graph through the preset graph convolutional network to obtain the initial vector corresponding to each of the initial word vectors.
[0045] Understandably, the sentiment label map is a graph constructed based on preset sentiment labels of the same sentiment category. Sentiment categories include joy, anger, sorrow, happiness, and fear. The preset graph convolutional network is a pre-trained graph convolutional neural network. The initial vector is obtained by performing graph convolution on the preset sentiment labels in the sentiment label map.
[0046] Specifically, after obtaining the initial word vectors, the initial word vectors corresponding to each preset sentiment tag are used as nodes. Preset sentiment tags within the same sentiment category are connected to construct edges in the graph. The probability of two preset sentiment tags appearing simultaneously is determined as the weight value of the edge, thus constructing a sentiment tag graph based on all edges and all weight values. Further, a preset graph convolutional network is invoked to predict the relevance of the sentiment tag graph; that is, information is passed between nodes in the sentiment tag graph through the transmission layer in the preset graph convolutional network to obtain information vectors. Information is fused between the information vectors and nodes through the fusion layer in the preset graph convolutional network to obtain a fused vector. A nonlinear transformation is performed on the fused vector through the transform layer in the preset graph convolutional network to obtain a transformed vector. Thus, the initial vectors corresponding to each initial word vector can be obtained through just two graph convolutional layers.
[0047] This invention encodes preset emotion tags using a preset encoding model, thereby vectorizing each word within the preset emotion tags and acquiring the input vector. Then, attention processing and fully connected processing are applied to the input vector using the preset encoding model, resulting in the acquisition of attention vectors and initial word vectors, thus improving the accuracy of subsequent music emotion recognition.
[0048] In one embodiment, step S104, namely constructing a sentiment tag map based on all the initial word vectors, includes:
[0049] S1041, construct nodes in the same sentiment category based on all the initial word vectors, and connect all the nodes in the same sentiment category to construct the first edge.
[0050] Understandably, an emotion category refers to a type of emotion, such as joy, anger, sorrow, happiness, and fear. A preset emotion label is a label representing a certain emotion, such as happiness, joy, worry, sadness, anger, resentment, fear, and terror. The first side is a connection between two preset emotion labels within the same emotion category, such as happiness and joy, worry and sadness, fear and terror. A node is a connection point, i.e., an initial word vector, which is used to represent a preset emotion label.
[0051] Specifically, after obtaining all initial word vectors, all preset sentiment tags in the database are retrieved. These preset sentiment tags are then divided according to sentiment category; that is, by calculating the similarity between a preset sentiment tag and a sentiment category, the preset sentiment tag is assigned to the sentiment category with the highest similarity. In this way, all preset sentiment tags are divided, and the initial word vectors corresponding to preset sentiment tags within the same sentiment category are determined as nodes within that sentiment category. Further, all nodes within the same sentiment category are connected to construct the first edge. Thus, by connecting all nodes within the same sentiment category, all first edges for each sentiment category are obtained.
[0052] S1042, obtain the probability values among all nodes in the same emotion category, and determine the first side weight value of the first side.
[0053] S1043, Construct a sentiment label graph based on all the first edges and all the weight values of the first edges.
[0054] Understandably, the probability value is the probability that two preset emotion labels appear simultaneously within the same emotion category. The weight value of the first side is the weight value of the first side. The historical dataset is constructed from all music that has undergone emotion recognition or has been manually labeled.
[0055] Specifically, after obtaining the first side, the historical dataset is retrieved. The frequency of a single tag appearing alone in all music tracks within the historical dataset is counted, as are the frequency of two tags (including a single tag) appearing simultaneously. The proportion of simultaneous appearances of two tags to the frequency of single appearances of a single tag is calculated, and this proportion is determined as the probability value, thus obtaining the probability value between two nodes within the same sentiment category. For example, the "happy" tag appears 1000 times alone in all music tracks in the historical dataset, and the "happy" and "joyful" tags appear simultaneously 800 times, resulting in a probability value of 0.8 between the "happy" and "joyful" tags. This process is repeated to obtain the probability values between all nodes. Further, the probability values between all nodes within the same sentiment category are determined as the first-side weight values of the first side. For example, the probability value of 0.8 between the "happy" and "joyful" tags represents the first-side weight value of 0.8 between the two nodes. This process yields the first-side weight values corresponding to each first side. A sentiment label graph is then constructed based on all first sides and all first-side weight values; that is, the graph composed of all first sides and all first-side weight values is defined as the sentiment label graph.
[0056] This invention constructs the first edge by connecting all nodes within the same emotion category. The weights of the first edge are determined by statistically analyzing the probability values between all nodes. An emotion labeling graph is constructed using all first edges and their weights, thus improving the accuracy of subsequent emotion recognition.
[0057] In one embodiment, such as Figure 4 As shown, in step S104, that is, performing graph convolution processing on the sentiment tag graph through the preset graph convolutional network to obtain the initial vector corresponding to each of the initial word vectors, including:
[0058] S1044, Information vectors are obtained by passing information to nodes in the sentiment tag graph through the transit layer in the preset graph convolutional network.
[0059] Understandably, the information vector is the information contained in a node, used to characterize the features of each node.
[0060] Specifically, after obtaining the sentiment label map, it is input into a pre-defined graph convolutional network. The transitive layer within this network transmits information between nodes in the sentiment label map; each node sends its own feature information as a vector to its neighboring nodes. In other words, each node aggregates its own features to form a message vector, which is then used as the information vector. In this way, the features of surrounding neighboring nodes are transmitted to the current node, resulting in the information vector corresponding to each node.
[0061] S1045, information is fused between the information vector and the node through the fusion layer in the preset graph convolutional network to obtain a fused vector.
[0062] Understandably, the fusion vector is obtained by fusing its own features with the features of its neighboring nodes.
[0063] Specifically, after obtaining the information vector, the information vector and nodes are fused using a fusion layer in a pre-defined graph convolutional network. This involves updating all nodes according to the node update function, i.e., updating the node vector at the current time step. The vector of the node at the current time step is combined with the features obtained from the information vector. Based on the weight values between different nodes, the features of all neighboring nodes related to that node are fused into the representation of that node, thus obtaining a fusion vector. In this way, by fusing the information vectors transmitted by neighboring nodes into the vector of that node based on the weight values between two nodes, a fusion vector corresponding to each node is obtained.
[0064] S1046, the fusion vector is nonlinearly transformed by the transformation layer in the preset graph convolutional network to obtain the initial vector corresponding to each of the initial word vectors.
[0065] Specifically, after obtaining the fused vectors, a nonlinear transformation is performed on all fused vectors through a transformation layer in a pre-defined graph convolutional network. This transformation is achieved by using the ReLU function to perform a nonlinear transformation on the fused vectors, resulting in a new feature. This new feature is then used for linear classification, which corresponds to the transformed vector in the fused vector space. Similarly, two graph convolutional layers in the pre-defined graph convolutional network perform graph convolution processing on the nodes in the sentiment tag graph, thereby obtaining the initial vectors corresponding to each initial word vector.
[0066] This invention utilizes a transitive layer in a pre-defined graph convolutional network to transmit information between nodes in the sentiment labeling graph, thereby determining the information vector. A fusion layer in the same network fuses the information vector and nodes, thus determining the fused vector. Finally, a transform layer in the network performs a non-linear transformation on the fused vector, determining the transformed vector and ultimately obtaining the initial vector, improving the accuracy of subsequent sentiment recognition.
[0067] In one embodiment, step S10, that is, before performing graph convolution processing on the sentiment tag graph using a preset graph convolutional network, includes:
[0068] S105, Obtain a sample training dataset, wherein the sample training dataset includes at least one sample training data; one sample training data corresponds to one sample label.
[0069] Understandably, the sample training data consists of pre-defined emotional labels for music, obtained through manual annotation or other methods of emotional recognition. This sample training data can be collected from different websites or databases using web scraping techniques, or sent from the client to the server. A sample training dataset is then constructed based on all the acquired sample training data. Each sample training data point is assigned a sample label, which represents the true vector of that sample training data.
[0070] S106, Obtain a preset training model, and use the preset training model to predict the sample training data to obtain the predicted label.
[0071] Understandably, the preset training model is a model pre-set to predict sample training data. The predicted label is a predicted vector obtained by predicting the sample training data using the preset training model, used to represent the predicted vector corresponding to the sample training data.
[0072] Specifically, after obtaining the sample training data and sample labels, all sample training data are classified according to emotion categories, that is, all preset emotion labels for music are divided according to emotion categories, thus obtaining the historical preset emotion labels corresponding to each emotion category. Based on all historical preset emotion labels corresponding to the same emotion category, historical nodes for that emotion category are constructed, and all historical nodes within the same emotion category are connected to construct the first historical edge. Historical probability values between all historical nodes within the same emotion category are obtained, and these historical probability values are determined as the weight values of the first historical edge. A historical emotion label graph is constructed based on all first historical edges and all first historical edge weight values.
[0073] Furthermore, all historical sentiment label maps are input into a pre-set training model. The model predicts the historical sentiment label maps by transmitting information to historical nodes in the map according to the message function within the model. This involves aggregating the features of each historical node to form a message vector, which is then passed to other neighboring historical nodes. The node update function in the model updates all nodes, specifically updating the current historical sentiment category node representation and combining it with the features obtained from the message vector. Finally, the transformation module in the model performs sentiment recognition on the updated historical sentiment category nodes and the updated historical nodes to obtain the predicted labels corresponding to each training sample.
[0074] S107, determine the prediction loss value of the preset training model based on the sample label and the prediction label corresponding to the same sample training data.
[0075] Understandably, the prediction loss is generated during the process of predicting the predicted labels on the sample training data and is used to characterize the difference between the sample label and the predicted label.
[0076] Specifically, after obtaining the predicted labels, all sample labels corresponding to the sample training data are arranged according to the order of the sample training data in the sample training dataset. Then, the predicted labels associated with the sample training data are compared with the sample labels of the sample training data with the same sequence. That is, according to the sample training data, the sample label corresponding to the first sample training data is compared with the predicted label corresponding to the first sample training data, and the loss value between the sample label and the predicted label is determined by the loss function. Then, the sample label corresponding to the second sample training data is compared with the predicted label corresponding to the second sample training data, until all sample labels and predicted labels have been compared, and the prediction loss value of the preset training model can be determined.
[0077] S108, when the predicted loss value does not reach the preset convergence condition, iteratively update the initial parameters in the preset training model until the predicted loss value reaches the convergence condition, and record the converged preset training model as a preset graph convolutional network.
[0078] Understandably, the convergence condition can be the condition that the predicted loss value is less than a set threshold, that is, when the predicted loss value is less than the set threshold, training stops; the convergence condition can also be the condition that the predicted loss value is very small after 500 calculations and will not decrease further, at which point training stops.
[0079] Specifically, after determining the predicted loss value of the preset training model, if the predicted loss value does not reach the preset convergence condition, the initial parameters of the preset training model are adjusted based on the predicted loss value. All sample data are then re-input into the preset training model with adjusted initial parameters. The preset training model with adjusted initial parameters is trained using the sample data to obtain the predicted loss value corresponding to the preset training model with adjusted initial parameters. If the predicted loss value does not reach the preset convergence condition, the initial parameters of the preset recognition model are adjusted again based on the predicted loss value, so that the predicted loss value of the preset training model with adjusted initial parameters reaches the preset convergence condition. In this way, the output of the preset training model can continuously approach the accurate result, making the prediction accuracy higher and higher, until the predicted loss values of the preset training model all reach the preset convergence condition. The converged preset training model is then recorded as a preset graphical convolutional network.
[0080] This invention trains a pre-defined training model using a large amount of sample training data and determines the prediction loss value between the predicted label and the sample label using a pre-defined loss function. The initial parameters of the pre-defined training model are adjusted based on the prediction loss value until the model converges, thereby obtaining a pre-defined graph convolutional network and ensuring that the pre-defined graph convolutional network has a high prediction accuracy.
[0081] In one embodiment, step S20, namely, extracting features from all the music pieces to be identified to obtain audio features corresponding to each music piece to be identified, includes:
[0082] S201, Perform spectral analysis on all the music to be identified to obtain the Mel spectrum corresponding to each of the music to be identified.
[0083] Understandably, the Mel spectrum is based on converting the Fourier transform spectrum into a spectrum that is more consistent with human hearing using a non-linear Mel scale.
[0084] Specifically, after obtaining the music to be identified, functions from the librosa library are used to convert the audio of the music into a corresponding Fourier transform spectrum. This involves reading the WAV file of the music to be identified to obtain the audio time series and sampling rate. The audio file is then resampled, meaning the length of the resampled signal is adjusted so that the original sampling rate and the target sampling rate are exactly equal, resulting in a resampled audio array. The duration (in seconds) is read, and then the sampling rate is read to obtain the audio file's sampling rate. Audio is written based on the duration and sampling rate, and the time series is output as an audio file. The zero-crossing rate of the audio time series is calculated, and the waveform of the audio file is plotted. The waveform is then transformed using a short-time Fourier transform to obtain the short-time Fourier matrix. An inverse short-time Fourier transform is performed on the short-time Fourier matrix to obtain the time-domain signal. Amplitude and power conversions are then performed on the time-domain signal to obtain the Fourier transform spectrum. The Fourier transform spectrum is processed by a Mel filter, which converts the linear natural spectrum into a Mel spectrum that reflects the characteristics of human hearing, thereby obtaining the Mel spectrum corresponding to the music to be identified.
[0085] S202, Obtain a two-dimensional convolutional neural network, and extract features from the Mel spectrum through the two-dimensional convolutional neural network to obtain audio features.
[0086] Understandably, a two-dimensional convolutional neural network is a convolutional neural network, i.e., a CNN network. Audio features are features used to characterize the music to be identified.
[0087] Specifically, after obtaining the Mel spectrum, a two-dimensional convolutional neural network (CNN) is retrieved from the database. All Mel spectra are input into the CNN, and feature extraction is performed on each Mel spectrum individually. This involves convolution processing of the Mel spectra using the first convolutional layer of the CNN, specifically by applying a convolution kernel to the Mel spectra to obtain the first convolution result. The first convolution result is then activated using the ReLU nonlinear function to obtain the first convolutional feature. Finally, the first convolutional feature is pooled using a first pooling layer, specifically by using pooling matrices to pool the first convolutional feature. The maximum or average value in each pooling matrix is then used as the pooling result, thus obtaining the first pooled feature.
[0088] Furthermore, the pooling features are convolved through a second convolutional layer in a two-dimensional convolutional neural network to obtain a second convolutional result. This second convolutional result is then activated using the ReLU nonlinear function to obtain the second convolutional feature. This second convolutional feature is then pooled through a second pooling layer to obtain the second pooling feature. This second pooling feature is then input into a fully connected layer. The hidden units in the first fully connected layer predict the second pooling feature based on different first weights, resulting in the output of the first hidden layer. The hidden units in the second fully connected layer then predict the output of the first hidden layer based on different second weights, thus obtaining the audio feature. In this way, the audio features corresponding to each piece of music to be identified are obtained.
[0089] This invention achieves the acquisition of the Mel spectrum corresponding to each piece of music by performing spectral analysis on the music to be identified. Feature extraction of the Mel spectrum is then performed using a two-dimensional convolutional neural network, thereby extracting audio features and facilitating subsequent emotion recognition in the music.
[0090] In one embodiment, step S30, namely, performing emotion recognition on all the audio features and all the initial vectors using an emotion classifier to obtain at least one target emotion tag corresponding to the music to be identified, includes:
[0091] S301, based on all the said audio features and all the said initial vectors, determine the target vector corresponding to each of the said audio features.
[0092] S302, the target vector is filtered to obtain at least one target emotion tag corresponding to the music to be identified.
[0093] Understandably, the target vector is an initial vector associated with the audio features. The target sentiment label is the sentiment label of the music to be identified.
[0094] Specifically, after obtaining the relevant values, all audio features and initial vectors are input into the sentiment classifier. The sentiment classifier calculates the Euclidean distance or cosine similarity between the audio features and the initial vectors, thus obtaining the Euclidean distance or cosine similarity between the audio features and all initial vectors. All initial vectors are sorted according to their Euclidean distance or cosine similarity. Then, using box plot analysis, initial vectors with Euclidean distances or cosine similarities less than the maximum value of the interval are deleted, retaining those with Euclidean distances or cosine similarities greater than or equal to the maximum value of the interval. These retained initial vectors are then designated as target vectors, thus obtaining the target vectors corresponding to each audio feature. Further, the target vectors are sorted in descending order of their Euclidean distance or cosine similarity, and a predetermined number (e.g., 2, 3, or 4) of the target vectors are selected as the target sentiment tags corresponding to the music to be identified. In this way, at least one target sentiment tag corresponding to each piece of music to be identified is obtained.
[0095] This invention, through all audio features and all initial word vectors, determines all target vectors. By filtering all target vectors, it acquires target emotion tags, thereby improving the accuracy and precision of music emotion recognition.
[0096] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0097] In one embodiment, a music emotion recognition device based on graph convolutional networks is provided, which corresponds one-to-one with the music emotion recognition method based on graph convolutional networks described in the above embodiments. For example... Figure 5 As shown, the music emotion recognition device based on graph convolutional networks includes a vector acquisition module 11, a feature extraction module 12, and an emotion recognition module 13. Detailed descriptions of each functional module are as follows:
[0098] The vector acquisition module 11 is used to acquire the initial vector corresponding to each preset sentiment label. The initial vector is obtained by performing graph convolution processing on the sentiment label graph through a preset graph convolution network. The sentiment label graph is constructed and generated based on the initial word vector obtained after encoding the preset sentiment labels.
[0099] Feature extraction module 12 is used to acquire at least one piece of music to be identified, extract features from all the pieces of music to be identified, and obtain audio features corresponding to each piece of music to be identified.
[0100] The emotion recognition module 13 is used to perform emotion recognition on all the audio features and all the initial vectors through an emotion classifier to obtain at least one target emotion label corresponding to the music to be recognized.
[0101] In one embodiment, the feature extraction module 12 includes:
[0102] The spectrum analysis unit is used to perform spectrum analysis on all the music to be identified, and obtain the Mel spectrum corresponding to each of the music to be identified.
[0103] The feature extraction unit is used to obtain a two-dimensional convolutional neural network, and to extract features from the Mel spectrum through the two-dimensional convolutional neural network to obtain audio features.
[0104] In one embodiment, the vector acquisition module 11 includes:
[0105] An input vector unit is used to obtain a preset encoding model, and to preprocess all the preset sentiment tags through the input layer of the preset encoding model to obtain an input vector corresponding to each preset sentiment tag;
[0106] An attention vector unit is used to perform attention processing on all the input vectors through the attention layer of the preset encoding model to obtain an attention vector corresponding to each preset sentiment label;
[0107] The word vector unit is used to predict all the attention vectors through the fully connected layer of the preset encoding model to obtain the initial word vectors corresponding to each preset sentiment label;
[0108] The graph convolutional unit is used to construct a sentiment label graph based on all the initial word vectors, and to perform graph convolution processing on the sentiment label graph through the preset graph convolutional network to obtain the initial vector corresponding to each of the initial word vectors.
[0109] In one embodiment, the graph convolution unit includes:
[0110] The first edge construction unit is used to construct nodes in the same sentiment category based on all the initial word vectors, and connect all the nodes in the same sentiment category to construct the first edge;
[0111] The weight value unit is used to obtain the probability values between all nodes in the same sentiment category and determine the first side weight value of the first side.
[0112] The sentiment label graph unit is used to construct a sentiment label graph based on all the first edges and all the weight values of the first edges.
[0113] In one embodiment, the graph convolution unit further includes:
[0114] The information transmission unit is used to transmit information to nodes in the sentiment tag graph through the transmission layer in the preset graph convolutional network to obtain an information vector;
[0115] An information fusion unit is used to fuse the information vector and the node through a fusion layer in the preset graph convolutional network to obtain a fused vector.
[0116] The vector transformation unit is used to perform a nonlinear transformation on the fused vector through the transformation layer in the preset graph convolutional network to obtain an initial vector corresponding to each of the initial word vectors.
[0117] In one embodiment, the vector acquisition module 11 further includes:
[0118] A sample acquisition unit is used to acquire a sample training dataset, wherein the sample training dataset includes at least one sample training data; one sample training data corresponds to one sample label.
[0119] The label prediction unit is used to acquire a preset training model, and use the preset training model to predict the sample training data to obtain the predicted label;
[0120] The prediction loss value unit is used to determine the prediction loss value of the preset training model based on the sample label and the prediction label corresponding to the same sample training data.
[0121] The model convergence unit is used to iteratively update the initial parameters in the preset training model when the predicted loss value does not reach the preset convergence condition, until the predicted loss value reaches the convergence condition, and then record the converged preset training model as a preset graph convolutional network.
[0122] In one embodiment, the emotion recognition module 13 includes:
[0123] The target vector determination unit is used to determine the target vector corresponding to each of the audio features based on all the audio features and all the initial vectors;
[0124] The tag filtering unit is used to filter the target vector to obtain at least one target emotion tag corresponding to the music to be identified.
[0125] Specific limitations regarding the music emotion recognition device based on graph convolutional networks can be found in the limitations of the music emotion recognition method based on graph convolutional networks mentioned above, and will not be repeated here. Each module in the aforementioned music emotion recognition device based on graph convolutional networks can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.
[0126] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 6 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores the operating system, computer programs, and the database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database stores the data used in the music emotion recognition method based on graph convolutional networks described in the above embodiments. The network interface is used for communication with external terminals via a network connection. When the computer program is executed by the processor, it implements the music emotion recognition method based on graph convolutional networks.
[0127] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the above-described music emotion recognition method based on graph convolutional networks.
[0128] In one embodiment, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described music emotion recognition method based on graph convolutional networks.
[0129] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0130] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
[0131] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A music emotion recognition method based on graph convolutional networks, characterized in that, include: An initial vector corresponding to each preset sentiment label is obtained. The initial vector is obtained by performing graph convolution on the sentiment label graph through a preset graph convolutional network. The sentiment label graph is constructed based on the initial word vectors obtained after encoding the preset sentiment labels. The initial word vectors corresponding to each preset sentiment label are used as nodes. Preset sentiment labels in the same sentiment category are connected to construct the edges of the graph. The probability of two preset sentiment labels appearing at the same time is determined as the weight value of the edge. The sentiment label graph is constructed based on all edges and all weight values. At least one piece of music to be identified is obtained, and features are extracted from all the pieces of music to be identified to obtain audio features corresponding to each piece of music to be identified. An emotion classifier is used to perform emotion recognition on all the audio features and all the initial vectors to obtain at least one target emotion tag corresponding to the music to be identified. Specifically, each audio feature and all initial vectors are multiplied by a dot product and then input into the emotion classifier. The emotion classifier calculates the distance between each audio feature and all initial vectors, and then selects all initial vectors based on the distance to obtain target vectors. All target vectors are then sorted in descending order and selected by a preset number of target vectors. The preset emotion tag corresponding to the selected target vector is determined as the target emotion tag.
2. The music emotion recognition method based on graph convolutional networks as described in claim 1, characterized in that, Before obtaining the initial vector corresponding to each preset emotion tag, the process includes: Obtain a preset encoding model, and preprocess all the preset emotion tags through the input layer of the preset encoding model to obtain the input vector corresponding to each preset emotion tag; Attention processing is performed on all the input vectors through the attention layer of the preset encoding model to obtain attention vectors corresponding to each preset sentiment label; The attention vectors are predicted by the fully connected layer of the preset encoding model to obtain the initial word vectors corresponding to each preset sentiment tag. A sentiment tag graph is constructed based on all the initial word vectors, and the sentiment tag graph is processed by graph convolution through the preset graph convolutional network to obtain the initial vector corresponding to each of the initial word vectors.
3. The music emotion recognition method based on graph convolutional networks as described in claim 2, characterized in that, The construction of the sentiment label map based on all the initial word vectors includes: Based on all the initial word vectors, nodes belonging to the same sentiment category are constructed, and all nodes belonging to the same sentiment category are connected to construct the first edge; Obtain the probability values among all nodes in the same emotional category, and determine the weight value of the first side of the first side; Construct a sentiment label graph based on all first edges and all first edge weights.
4. The music emotion recognition method based on graph convolutional networks as described in claim 2, characterized in that, The step of performing graph convolution processing on the sentiment tag graph through the preset graph convolutional network to obtain the initial vector corresponding to each of the initial word vectors includes: Information vectors are obtained by passing information between nodes in the sentiment tag graph through the transit layer in the preset graph convolutional network. The information vector and the node are fused through the fusion layer in the preset graph convolutional network to obtain a fused vector; The fusion vector is nonlinearly transformed by the transformation layer in the preset graph convolutional network to obtain the initial vector corresponding to each initial word vector.
5. The music emotion recognition method based on graph convolutional networks as described in claim 1, characterized in that, The step of extracting features from all the music pieces to be identified, to obtain audio features corresponding to each music piece to be identified, includes: Perform spectral analysis on all the music to be identified to obtain the Mel spectrum corresponding to each music to be identified; A two-dimensional convolutional neural network is obtained, and the features of the Mel spectrum are extracted through the two-dimensional convolutional neural network to obtain audio features.
6. The music emotion recognition method based on graph convolutional networks as described in claim 1, characterized in that, Before performing graph convolution processing on the sentiment tag graph using a preset graph convolutional network, the following steps are included: Obtain a sample training dataset, which includes at least one sample training data; each sample training data corresponds to one sample label. Obtain a preset training model, and use the preset training model to predict the sample training data to obtain the predicted label; The prediction loss value of the preset training model is determined based on the sample label and the prediction label corresponding to the same sample training data. When the predicted loss value does not reach the preset convergence condition, the initial parameters in the preset training model are iteratively updated until the predicted loss value reaches the convergence condition. Then, the converged preset training model is recorded as a preset graph convolutional network.
7. A music emotion recognition device based on graph convolutional networks, characterized in that, include: The vector acquisition module is used to acquire the initial vectors corresponding to each preset sentiment label. The initial vectors are obtained by performing graph convolution processing on the sentiment label graph through a preset graph convolutional network. The sentiment label graph is constructed based on the initial word vectors obtained after encoding the preset sentiment labels. Specifically, the initial word vectors corresponding to each preset sentiment label are used as nodes, preset sentiment labels in the same sentiment category are connected to construct the edges of the graph, the probability of two preset sentiment labels appearing at the same time is determined as the weight value of the edge, and the sentiment label graph is constructed based on all edges and all weight values. The feature extraction module is used to acquire at least one piece of music to be identified, extract features from all the pieces of music to be identified, and obtain audio features corresponding to each piece of music to be identified. The emotion recognition module is used to perform emotion recognition on all the audio features and all the initial vectors through an emotion classifier to obtain at least one target emotion tag corresponding to the music to be recognized. Specifically, each audio feature and all initial vectors are multiplied by a dot product and then input into the emotion classifier. The emotion classifier calculates the distance between each audio feature and all initial vectors, and then selects all initial vectors based on the distance to obtain target vectors. All target vectors are then sorted in descending order and a preset number of target vectors are selected. The preset emotion tag corresponding to the selected target vector is determined as the target emotion tag.
8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the music emotion recognition method based on graph convolutional networks as described in any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the music emotion recognition method based on graph convolutional networks as described in any one of claims 1 to 6.
Citation Information
Patent Citations
Cross-corpus emotion recognition method based on graph convolutional neural network
CN113112994A