A method, device and medium for intelligent control of light
By acquiring the Mel spectrum features of audio data in a virtual scene, extracting emotion scores using a convolutional neural network model, and converting them into polar coordinates to control lighting, the problem of inaccurate auditory-visual linkage and emotion classification in virtual scenes is solved, achieving highly accurate lighting control.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD
- Filing Date
- 2023-10-20
- Publication Date
- 2026-06-16
AI Technical Summary
In existing virtual scenes, lighting colors are usually provided by designers, failing to achieve true linkage between auditory and visual senses. Furthermore, the accuracy of emotion classification based on the SVM model is not high, and the academic community lacks a unified standard for classifying emotional pleasure and activation, leading to inaccurate emotion classification.
By acquiring the Mel spectrum features of audio data and inputting them into a pre-trained convolutional neural network model, emotional pleasure and activation scores are extracted and converted into polar coordinates to control the color and brightness of lights. The polar angle and polar radius are used to reflect the proportion and amplitude of the emotional dimension scores for light control.
It improves the accuracy of emotion classification, achieves true linkage between auditory and visual senses, avoids the irrationality of using specific points to represent emotion categories, and reduces visual fatigue.
Smart Images

Figure CN117636911B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a method, device and medium for intelligent lighting control. Background Technology
[0002] Light and music can provide sensory experiences through sight and hearing, respectively, and their combination can enrich the sensory experience. On the one hand, the lighting colors in existing virtual scenes are usually provided by designers, and the true linkage between hearing and sight is not achieved. On the other hand, the method of classifying the valence and arousal of emotions based on the SVM (Support Vector Machine) model, and then obtaining the missing colors by marking the colors of key points and then by linear interpolation, has the following drawbacks: (1) The accuracy of emotion classification based on the SVM model is not high; (2) There is no unified standard in academia for the specific values of a certain emotion in the valence and arousal classification, and the method of using specific points to represent a certain emotion category is not reasonable. See Figure 1 As shown, Figure 1 This is a diagram where a point represents a certain emotion category.
[0003] Therefore, how to improve the accuracy of emotion classification in order to achieve true linkage between auditory and visual senses is an urgent problem to be solved in this field. Summary of the Invention
[0004] In view of this, the purpose of this invention is to provide a method, device, and medium for intelligent lighting control, which can improve the accuracy of emotion classification and achieve true linkage between hearing and vision. The specific solution is as follows:
[0005] In a first aspect, this application discloses a method for intelligent lighting control, comprising:
[0006] Acquire audio data and extract the Mel-spectral features corresponding to the audio data;
[0007] The Mel spectrum features are input into a pre-trained target convolutional neural network model to obtain the first emotional dimension score and the second emotional dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively; the first emotional dimension score is the score of emotional pleasure corresponding to the audio data, and the second emotional dimension score is the score of emotional activation corresponding to the audio data.
[0008] The rectangular coordinates formed by the first emotional dimension score and the second emotional dimension score are converted into corresponding polar coordinates, and the light color and light brightness of the light to be controlled are controlled based on the polar angle and polar radius in the polar coordinates, respectively.
[0009] Optionally, the light to be controlled includes lights in a virtual scene and / or lights in a physical scene.
[0010] Optionally, extracting the Mel-spectral features corresponding to the audio data includes:
[0011] The audio data is resampled based on a preset sampling rate to obtain resampled audio data.
[0012] The resampled audio data is divided into several audio segments according to a preset time interval, so as to generate Mel spectrum features corresponding to the audio data using the several audio segments.
[0013] Optionally, before inputting the Mel spectrum features into the pre-trained target convolutional neural network model, the method further includes:
[0014] Historical audio data is acquired, and the emotional pleasure and emotional activation of the historical audio data are scored to obtain the tag information corresponding to the historical audio data.
[0015] Extract the Mel-spectral features corresponding to the historical audio data;
[0016] The Mel-spectral features corresponding to the historical audio data and the label information are input into the convolutional neural network model to be trained to obtain the target convolutional neural network model.
[0017] Optionally, the intelligent lighting control method further includes:
[0018] A convolutional layer, a first linear rectified activation function, a max pooling layer, a flattening layer, a fully connected layer, a second linear rectified activation function, a batch normalized layer, and a dropout layer are sequentially connected. The output of the dropout layer is connected to the input of the first output neuron and the second output neuron, respectively. The outputs of the first output neuron and the second output neuron are then connected to the sigmoid activation function, respectively, to obtain the convolutional neural network model to be trained.
[0019] Optionally, controlling the light color and brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates respectively includes:
[0020] Find the target color corresponding to the polar angle in the polar coordinates from the pre-created target color wheel, and adjust the light color of the light to be controlled to the target color;
[0021] The target brightness corresponding to the polar radius in the polar coordinates is calculated based on the preset brightness calculation formula, and the brightness of the light to be controlled is adjusted to the target brightness.
[0022] Optionally, before searching for the target color corresponding to the polar angle in the polar coordinates from the pre-created target color wheel, the method further includes:
[0023] The range of the area to be filled is determined based on the value range of the emotional pleasure level and the emotional activation level;
[0024] Using the center point of the area to be filled as the center, divide the area to be filled into a sub-region at preset angle intervals to obtain a number of sub-regions to be filled;
[0025] The emotional pleasure and emotional activation levels corresponding to each sub-region to be filled are analyzed to determine the filling color corresponding to each sub-region to be filled.
[0026] The target color wheel is obtained by filling the corresponding sub-region with the fill color corresponding to each sub-region to be filled.
[0027] Optionally, the preset brightness calculation formula is:
[0028] I = e x ×I0;
[0029] Where e is a natural constant, x represents the polar radius in the polar coordinates, and I0 is the brightness determined in advance based on empirical values.
[0030] Optionally, extracting the Mel-spectral features corresponding to the audio data includes:
[0031] Monitor the audio segmentation points in the audio data;
[0032] The audio data is segmented based on the detected audio segmentation points, and the Mel spectrum features corresponding to each segmented audio segment are extracted respectively.
[0033] Accordingly, the step of inputting the Mel spectrum features into the pre-trained target convolutional neural network model to obtain the first sentiment dimension score and the second sentiment dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively, includes:
[0034] The Mel-spectral features of each audio segment are input into a pre-trained target convolutional neural network model to obtain the first emotional dimension score and the second emotional dimension score corresponding to each audio segment.
[0035] Optionally, monitoring audio segmentation points in the audio data includes:
[0036] Determine several target lyrics and their corresponding time information from the audio data to obtain the corresponding lyrics information;
[0037] Determine the lyrics similarity matrix corresponding to the lyrics information; wherein, the matrix elements in the lyrics similarity matrix are the similarity between any two lines of the target lyrics;
[0038] Based on the lyrics similarity matrix, the audio data is used to identify the verses and choruses, resulting in several verse and chorus segments;
[0039] The timestamp sequence of the verse and chorus segments is determined based on the time information of the target lyrics, and the audio segmentation points in the audio data are determined based on the timestamp sequence.
[0040] Optionally, the process of determining several target lyrics in the audio data includes:
[0041] The system monitors whether there are several adjacent target short lyrics in the audio data; wherein, the similarity between the spliced lyrics corresponding to the several adjacent target short lyrics and any target long lyrics in the audio data is greater than a first preset similarity threshold, and the similarity between each target short lyrics and the target long lyrics is less than the first preset similarity threshold.
[0042] If the audio data contains several adjacent target short lyrics, then the spliced lyrics corresponding to the several adjacent target short lyrics are determined as one of the target lyrics in the audio data.
[0043] Optionally, the step of identifying the verses and choruses of the audio data based on the lyrics similarity matrix to obtain several verse and chorus segments includes:
[0044] The diagonal elements of the lyrics similarity matrix and the target elements whose similarity is less than the second preset similarity threshold are set to zero to obtain the zero-set similarity matrix.
[0045] A path search is performed on the zeroed-out similarity matrix to find a target path that meets a preset optimal path condition; the preset optimal path condition is a condition constructed based on a preset path length threshold.
[0046] Based on the target path, determine the corresponding verse and chorus segments in the audio data.
[0047] Secondly, this application discloses an electronic device, comprising:
[0048] Memory, used to store computer programs;
[0049] A processor is used to execute the computer program to implement the aforementioned intelligent lighting control method.
[0050] Thirdly, this application discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned intelligent lighting control method.
[0051] As can be seen, this application proposes a method for intelligent lighting control, comprising: acquiring audio data and extracting the Mel spectrum features corresponding to the audio data; inputting the Mel spectrum features into a pre-trained target convolutional neural network model to obtain a first emotional dimension score and a second emotional dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively; the first emotional dimension score is the score of emotional pleasure corresponding to the audio data, and the second emotional dimension score is the score of emotional activation corresponding to the audio data; converting the rectangular coordinates formed by the first emotional dimension score and the second emotional dimension score into corresponding polar coordinates, and controlling the light color and light brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates, respectively.
[0052] The beneficial effect of this application is that it inputs the Mel spectrum features of audio data into a pre-trained target convolutional neural network model so that the target convolutional neural network model can further extract the features of the audio data. Then, based on the extracted features, it obtains the first emotional dimension score of the emotional pleasure corresponding to the audio data and the second emotional dimension score of the emotional activation corresponding to the audio data. In this way, this application improves the accuracy of emotion classification based on the target convolutional neural network model. Furthermore, this application converts the rectangular coordinates formed by the first and second emotional dimension scores into corresponding polar coordinates, and controls the color and brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates, respectively. It should be noted that the polar angle in the polar coordinates can reflect the ratio of the first and second emotional dimension scores, and the polar radius in the polar coordinates can reflect the amplitude of the first and second emotional dimension scores. Therefore, this application does not simply represent a certain emotional category by a specific point formed by the first and second emotional dimension scores, but controls the color and brightness of the light based on the ratio and amplitude of the above two emotional dimension scores, avoiding the irrationality of representing a certain emotional category by a specific point, further improving the accuracy of emotional classification, and realizing true linkage between hearing and vision. Attached Figure Description
[0053] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0054] Figure 1 This is a traditional diagram that maps emotions to colors;
[0055] Figure 2 This is a flowchart of a lighting intelligent control method disclosed in this application;
[0056] Figure 3 This is a schematic diagram of a convolutional neural network model to be trained disclosed in this application;
[0057] Figure 4 This is a schematic diagram of a target color wheel disclosed in this application;
[0058] Figure 5 This is a flowchart of another intelligent lighting control method disclosed in this application;
[0059] Figure 6 This is a flowchart of a specific intelligent lighting control method disclosed in this application;
[0060] Figure 7 This is a schematic diagram of the structure of a smart lighting control device disclosed in this application;
[0061] Figure 8 This is a structural diagram of an electronic device disclosed in this application. Detailed Implementation
[0062] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0063] On the one hand, the lighting colors in existing virtual scenes are usually provided by designers, failing to achieve true linkage between auditory and visual senses. On the other hand, the accuracy of emotion classification based on SVM models is not high, and the academic community lacks a unified standard for the specific values of pleasure and activation for a certain emotion, and using specific points to represent a certain emotion category is unreasonable.
[0064] Therefore, this application proposes a lighting intelligent control scheme that can improve the accuracy of emotion classification, so as to achieve true linkage between hearing and vision.
[0065] This application discloses a method for intelligent lighting control. See also... Figure 2 As shown, the method includes:
[0066] Step S11: Acquire audio data and extract the Mel spectrum features corresponding to the audio data.
[0067] The audio data may be song audio data, game sound effect audio data, audiobook audio data, etc. The audio data may be obtained by pre-recording, real-time acquisition, or downloading from the Internet, etc., without any restrictions.
[0068] In this embodiment, Mel-spectral features corresponding to the audio data are extracted so that the audio data can be processed accordingly based on the Mel-spectral features. Specifically, since the sampling rate of audio data obtained through different methods is different, this application first resamples the audio data based on a preset sampling rate after obtaining the audio data to obtain resampled audio data. In this way, the sampling rate is unified. Then, the resampled audio data is divided into several audio segments according to a preset time interval, so as to generate Mel-spectral features corresponding to the audio data using the several audio segments. In a specific implementation, the preset sampling rate is 16kHz and the preset time interval is 0.2 seconds.
[0069] Step S12: Input the Mel spectrum features into the pre-trained target convolutional neural network model to obtain the first emotional dimension score and the second emotional dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively; the first emotional dimension score is the score of emotional pleasure corresponding to the audio data, and the second emotional dimension score is the score of emotional activation corresponding to the audio data.
[0070] Understandably, the first step is to train the target convolutional neural network model, and then use this model to perform sentiment classification on the audio data. The training process of the target convolutional neural network model is explained in detail below:
[0071] First, historical audio data is acquired, and the emotional pleasure and emotional activation levels of the historical audio data are scored to obtain the corresponding tag information. For example, -1 is set as the lowest score and 1 as the highest score. Suppose the historical audio data expresses a pleasant emotion and makes people very excited after listening to it, then its corresponding score is (0.8, 1). Another song also expresses a pleasant emotion and makes people feel calm and peaceful after listening to it, then its corresponding score is (0.8, -0.7).
[0072] Furthermore, the Mel spectrum features corresponding to the historical audio data are extracted, and then the Mel spectrum features corresponding to the historical audio data and the label information are input into the convolutional neural network model to be trained for training, so as to obtain the target convolutional neural network model.
[0073] In one specific implementation, the structure of the convolutional neural network model to be trained is shown below. Figure 3 As shown, the specific implementation includes: a convolutional layer (3x3 kernel size, 32 kernels), a first rectified linear unit (ReLU) activation function, a max pooling layer (2x2 filter size), a flattening layer, a fully connected layer, a second rectified linear unit (ReLU) activation function, a batch normalization layer, and a dropout layer (dropout rate 0.5) connected sequentially. The output of the dropout layer is connected to the input of the first output neuron and the second output neuron, respectively. The outputs of the first output neuron and the second output neuron are connected to the sigmoid activation function.
[0074] After obtaining the trained target convolutional neural network model, this embodiment inputs the Mel spectrum features corresponding to the audio data into the target convolutional neural network model to obtain the first emotional dimension score of emotional pleasure and the second emotional dimension score of emotional activation corresponding to the audio data, respectively output by the first output neuron and the second output neuron in the target convolutional neural network model.
[0075] Step S13: Convert the rectangular coordinates formed by the first emotional dimension score and the second emotional dimension score into corresponding polar coordinates, and control the light color and light brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates respectively.
[0076] In this embodiment, the rectangular coordinates formed by the first emotional dimension score and the second emotional dimension score are converted into corresponding polar coordinates. A target color corresponding to the polar angle in the polar coordinates is found from a pre-created target color wheel. The light color of the light to be controlled is then adjusted to the target color, thereby achieving control of the light color based on the polar angle in the polar coordinates. Further, a target brightness corresponding to the polar radius in the polar coordinates is calculated based on a preset brightness calculation formula, and the light brightness of the light to be controlled is adjusted to the target brightness, thereby achieving control of the light brightness based on the polar radius in the polar coordinates. It should be noted that the light to be controlled includes lights in virtual scenes and / or lights in physical scenes. Virtual scenes include, but are not limited to, games, virtual clubbing, etc., while physical scenes include, but are not limited to, KTVs, stages, etc.
[0077] Taking the Cartesian coordinates formed by the first and second emotional dimension scores as (0.4, 0.5) as an example, (0.4, 0.5) is converted to polar coordinates (0.64, 51), where 0.64 is the polar radius and 51 degrees is the polar angle. Further, from... Figure 4 The color corresponding to 51 degrees is found in the target color wheel shown, and this color is determined as the light color to be controlled. Further, the radius x = 0.64 is substituted into the preset brightness calculation formula shown below to calculate the light brightness of the light to be controlled:
[0078] I = e x ×I0;
[0079] Where e is the natural constant and I0 is the brightness determined in advance based on empirical values.
[0080] Among them, for Figure 4 The target color palette shown is determined through the following steps: First, the range of the area to be filled is determined based on the values of the emotional pleasure and emotional activation. Second, the area to be filled is divided into several sub-areas at predetermined angles, with the center point of the area to be filled as the center. Third, the emotional pleasure and emotional activation corresponding to each sub-area to be filled are analyzed to determine the corresponding fill color. Fourth, the corresponding sub-area to be filled is filled with the fill color using the fill color corresponding to each sub-area to obtain the target color palette.
[0081] See Figure 4As shown, in one specific implementation, the range of the emotional pleasure level and the emotional activation level is determined to be (-1, 1). The horizontal coordinate of the area to be filled is determined as the pleasure level, and the vertical coordinate of the area to be filled is determined as the activation level. Then, the area to be filled is divided into 12 sub-areas. Further, the emotional pleasure level and emotional activation level corresponding to each sub-area are determined to obtain the filling color for each sub-area. For example, yellow represents happiness, and red represents anger or alertness. Therefore, areas closer to high pleasure levels are filled with yellow, and areas closer to high activation levels are filled with red.
[0082] As can be seen, this application proposes a method for intelligent lighting control, comprising: acquiring audio data and extracting the Mel spectrum features corresponding to the audio data; inputting the Mel spectrum features into a pre-trained target convolutional neural network model to obtain a first emotional dimension score and a second emotional dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively; the first emotional dimension score is the score of emotional pleasure corresponding to the audio data, and the second emotional dimension score is the score of emotional activation corresponding to the audio data; converting the rectangular coordinates formed by the first emotional dimension score and the second emotional dimension score into corresponding polar coordinates, and controlling the light color and light brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates, respectively.
[0083] The beneficial effect of this application is that it inputs the Mel spectrum features of audio data into a pre-trained target convolutional neural network model so that the target convolutional neural network model can further extract the features of the audio data. Then, based on the extracted features, it obtains the first emotional dimension score of the emotional pleasure corresponding to the audio data and the second emotional dimension score of the emotional activation corresponding to the audio data. In this way, this application improves the accuracy of emotion classification based on the target convolutional neural network model. Furthermore, this application converts the rectangular coordinates formed by the first and second emotional dimension scores into corresponding polar coordinates, and controls the color and brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates, respectively. It should be noted that the polar angle in the polar coordinates can reflect the ratio of the first and second emotional dimension scores, and the polar radius in the polar coordinates can reflect the amplitude of the first and second emotional dimension scores. Therefore, this application does not simply represent a certain emotional category by a specific point formed by the first and second emotional dimension scores, but controls the color and brightness of the light based on the ratio and amplitude of the above two emotional dimension scores, avoiding the irrationality of representing a certain emotional category by a specific point, further improving the accuracy of emotional classification, and realizing true linkage between hearing and vision.
[0084] Furthermore, to avoid visual fatigue caused by frequent color changes, this embodiment uses the transition between the verse and chorus as the time point for light color switching, so that the light color of the light to be controlled is switched at the transition between the verse and chorus. The identification of the transition between the verse and chorus is explained in detail below; see [link to documentation]. Figure 5 As shown, it specifically includes:
[0085] Step S111: Monitor the audio segmentation points in the audio data.
[0086] In this embodiment, the audio segmentation point is the connection between the verse and chorus in the audio data. The specific determination process includes: determining several target lyrics and corresponding time information in the audio data to obtain corresponding lyric information, and determining the lyric similarity matrix corresponding to the lyric information. The matrix elements in the lyric similarity matrix are the similarity between any two target lyrics. Further, the verse and chorus are identified in the audio data according to the lyric similarity matrix to obtain several verse and chorus segments. Finally, the timestamp sequence of the verse and chorus segments is determined according to the time information of the target lyrics, and the audio segmentation point in the audio data is determined based on the timestamp sequence.
[0087] Regarding the lyrics information and lyrics similarity matrix mentioned above, assuming the audio data contains 5 target lyrics, the time information of the first target lyric is [13456, 2345], and the time information of the second target lyric is [16210, 3342]. Here, 13456 and 16210 represent the start time of the target lyrics, and 2345 and 3342 represent the duration of the target lyrics. This yields the corresponding lyrics information. Again, taking the audio data containing 5 target lyrics as an example, if the audio data contains 5 target lyrics, then the lyrics similarity matrix includes 5 rows and 5 columns. The matrix elements in the i-th row and j-th column represent the similarity between the i-th and j-th target lyrics among the 5 target lyrics.
[0088] Based on the principle that choruses often have similar lyrical structures, this embodiment employs Music Structural Segmentation (MSS) to segment the audio data into audio segments according to the verses and choruses using the aforementioned lyric similarity matrix and lyric information. This yields a timestamp sequence of segment positions, which represents the audio segmentation points. Specifically, the diagonal elements of the lyric similarity matrix and target elements with similarity less than a second preset similarity threshold are zeroed out to obtain a zeroed-out similarity matrix. A path search is then performed on this zeroed-out similarity matrix to find target paths that satisfy preset optimal path conditions. These preset optimal path conditions are constructed based on a preset path length threshold. Finally, based on these target paths, several corresponding verse and chorus segments in the audio data are determined.
[0089] In this embodiment, the second preset similarity threshold can be set based on an empirical value or based on the lyrics similarity matrix, and is not limited here. For example, the fourth largest value in the lyrics similarity matrix can be set as the second similarity threshold. In this embodiment, the diagonal elements in the lyrics similarity matrix represent the similarity between each lyric and itself, so they need to be zeroed out. In addition, target elements in the lyrics similarity matrix with similarity less than the second preset similarity threshold also need to be zeroed out, thus obtaining a zeroed-out similarity matrix. Further, a path search is performed on the zeroed-out similarity matrix to search for target paths that meet preset optimal path conditions. The preset optimal path conditions are conditions constructed based on a preset path length threshold, such as the number of elements on the target path being no less than 3. It can be understood that the similarity of each element on the target path is greater than or equal to the second similarity threshold.
[0090] In one specific implementation, assuming two similar paths are found, the elements of the first similar path are [22, 38], [23, 39], [24, 40], [25, 41], [26, 42], [27, 43], indicating that the similarity between the 22nd and 38th lines of lyrics is greater than the second similarity threshold, and the similarity between the 23rd and 39th lines of lyrics is greater than the second similarity threshold, etc. The elements of the second similar path are [10, 32], [11, 33], [12, 34], [13, 35], [14, 43], [25, 41], [26, 42], [27, 43], [24, 40], [25, 41], [26, 42], [27, 43 ...
[36] , [15, 37], further, based on the first similar path, two audio segments [22, 27] and [38, 43] are obtained, and based on the second similar path, two audio segments [10, 15] and [32, 37] are obtained. In this way, the audio data is divided into several audio segments in this embodiment. As can be seen from the aforementioned disclosed embodiments, this application can obtain the time information of each lyric. Therefore, this embodiment can determine the timestamp sequence of the segment positions of the above-mentioned audio segments based on the time information of the lyrics in order to obtain the audio segmentation points.
[0091] It should be noted that in the process of determining target lyrics, the lyrics similarity matrix is often inaccurate due to different divisions of the same lyrics. Therefore, this embodiment needs to correct the division of lyrics. The specific lyrics division correction method is as follows: monitor whether there are several adjacent target short lyrics in the audio data; wherein, the similarity between the spliced lyrics corresponding to the several adjacent target short lyrics and any target long lyrics in the audio data is greater than a first preset similarity threshold, and the similarity between each target short lyrics and the target long lyrics is less than the first preset similarity threshold. Furthermore, if the audio data contains the several adjacent target short lyrics, then the spliced lyrics corresponding to the several adjacent target short lyrics are determined as a line of the target lyrics in the audio data. For example, assuming the first preset similarity threshold is 0.9, and the 29th, 30th, and 31st lines of lyrics are adjacent target short lyrics, directly comparing the 30th and 39th lines yields a similarity of 0.5 (0.5 < 0.9). Concatenating the 30th and 29th lines and comparing them again yields a similarity of 1 (1 > 0.9). Concatenating the 30th and 31st lines and comparing them again yields a similarity of 0.35 (0.35 < 0.9). It can be seen that the similarity obtained by concatenating the 29th and 30th lines and then comparing them again with the 39th line is greater than the similarity obtained by comparing the 30th and 39th lines alone. Therefore, concatenating the 29th and 30th lines yields the target lyrics. Understandably, after concatenating the 29th and 30th lines of lyrics, the row and column index of the 30th line are deleted. This way, the lyrics similarity matrix obtained based on the corrected lyrics can improve the accuracy of verse and chorus identification.
[0092] Step S112: Segment the audio data based on the detected audio segmentation points, and extract the Mel spectrum features corresponding to each segmented audio segment.
[0093] After obtaining the audio segmentation points according to the above steps, the verse and chorus recognition of the audio segment is achieved. See [link / reference]. Figure 6As shown, this embodiment segments the audio data based on the audio segmentation points and extracts the Mel-spectral features corresponding to each audio segment. Then, the Mel-spectral features of each audio segment are input into a pre-trained target convolutional neural network model to obtain a first sentiment dimension score and a second sentiment dimension score for each audio segment. Further, the Cartesian coordinates constructed based on the first and second sentiment dimension scores are converted into corresponding polar coordinates, and the control of the light to be controlled is implemented according to the polar coordinates. The specific control process is described in the aforementioned disclosed embodiment and will not be elaborated upon here.
[0094] In summary, this application identifies the verse and chorus of the audio data, uses the transition point between the verse and chorus as the audio segmentation point, and switches the light color based on the audio segmentation point, thus avoiding visual fatigue caused by frequent color changes and improving the user's comfort.
[0095] Accordingly, this application also discloses a smart lighting control device, see [link to relevant documentation]. Figure 7 As shown, the device includes:
[0096] The feature extraction module 11 is used to acquire audio data and extract the Mel spectrum features corresponding to the audio data;
[0097] The emotion dimension score acquisition module 12 is used to input the Mel spectrum features into the pre-trained target convolutional neural network model to obtain the first emotion dimension score and the second emotion dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively; the first emotion dimension score is the score of emotional pleasure corresponding to the audio data, and the second emotion dimension score is the score of emotional activation corresponding to the audio data.
[0098] The lighting control module 13 is used to convert the rectangular coordinates formed by the first emotional dimension score and the second emotional dimension score into corresponding polar coordinates, and to control the light color and light brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates respectively.
[0099] For more detailed information on the working process of each of the above modules, please refer to the relevant content disclosed in the foregoing embodiments, which will not be repeated here.
[0100] As can be seen, this application proposes a lighting intelligent control device, including: a feature extraction module 11, used to acquire audio data and extract the Mel spectrum features corresponding to the audio data; an emotion dimension score acquisition module 12, used to input the Mel spectrum features into a pre-trained target convolutional neural network model to obtain the first emotion dimension score and the second emotion dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively; the first emotion dimension score is the score of emotional pleasure corresponding to the audio data, and the second emotion dimension score is the score of emotional activation corresponding to the audio data; and a lighting control module 13, used to convert the rectangular coordinates formed by the first emotion dimension score and the second emotion dimension score into corresponding polar coordinates, and control the light color and light brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates, respectively.
[0101] The beneficial effect of this application is that it inputs the Mel spectrum features of audio data into a pre-trained target convolutional neural network model so that the target convolutional neural network model can further extract the features of the audio data. Then, based on the extracted features, it obtains the first emotional dimension score of the emotional pleasure corresponding to the audio data and the second emotional dimension score of the emotional activation corresponding to the audio data. In this way, this application improves the accuracy of emotion classification based on the target convolutional neural network model. Furthermore, this application converts the rectangular coordinates formed by the first and second emotional dimension scores into corresponding polar coordinates, and controls the color and brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates, respectively. It should be noted that the polar angle in the polar coordinates can reflect the ratio of the first and second emotional dimension scores, and the polar radius in the polar coordinates can reflect the amplitude of the first and second emotional dimension scores. Therefore, this application does not simply represent a certain emotional category by a specific point formed by the first and second emotional dimension scores, but controls the color and brightness of the light based on the ratio and amplitude of the above two emotional dimension scores, avoiding the irrationality of representing a certain emotional category by a specific point, further improving the accuracy of emotional classification, and realizing true linkage between hearing and vision.
[0102] Furthermore, embodiments of this application also provide an electronic device. Figure 8 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application.
[0103] Figure 8This is a schematic diagram of the structure of an electronic device 20 provided in an embodiment of this application. Specifically, the electronic device 20 may include: at least one processor 21, at least one memory 22, a display screen 23, an input / output interface 24, a communication interface 25, a power supply 26, and a communication bus 27. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the intelligent lighting control method disclosed in any of the foregoing embodiments. Furthermore, the electronic device 20 in this embodiment may specifically be an electronic computer.
[0104] In this embodiment, the power supply 26 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 25 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 24 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.
[0105] Furthermore, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk, or optical disk, etc. The resources stored thereon may include computer programs 221, and the storage method may be temporary storage or permanent storage. In addition to including computer programs capable of performing the intelligent lighting control method executed by the electronic device 20 as disclosed in any of the foregoing embodiments, the computer program 221 may further include computer programs capable of performing other specific tasks.
[0106] Furthermore, embodiments of this application also disclose a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned intelligent lighting control method.
[0107] For the specific steps of this method, please refer to the relevant content disclosed in the foregoing embodiments, which will not be repeated here.
[0108] The various embodiments in this application are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. For the same or similar parts between the various embodiments, refer to each other. As for the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and relevant parts can be referred to in the method section.
[0109] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0110] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.
[0111] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0112] The above provides a detailed description of the intelligent lighting control method, device, and storage medium provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A method for intelligent lighting control, characterized in that, include: Acquire audio data and extract the Mel-spectral features corresponding to the audio data; The Mel spectrum features are input into a pre-trained target convolutional neural network model to obtain the first emotional dimension score and the second emotional dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively; the first emotional dimension score is the score of emotional pleasure corresponding to the audio data, and the second emotional dimension score is the score of emotional activation corresponding to the audio data. The rectangular coordinates formed by the first emotional dimension score and the second emotional dimension score are converted into corresponding polar coordinates, and the light color and light brightness of the light to be controlled are controlled based on the polar angle and polar radius in the polar coordinates, respectively.
2. The intelligent lighting control method according to claim 1, characterized in that, The lights to be controlled include lights in a virtual scene and / or lights in a physical scene.
3. The intelligent lighting control method according to claim 1, characterized in that, The extraction of Mel-spectral features corresponding to the audio data includes: The audio data is resampled based on a preset sampling rate to obtain resampled audio data. The resampled audio data is divided into several audio segments according to a preset time interval, so as to generate Mel spectrum features corresponding to the audio data using the several audio segments.
4. The intelligent lighting control method according to claim 1, characterized in that, Before inputting the Mel-spectral features into the pre-trained target convolutional neural network model, the process further includes: Historical audio data is acquired, and the emotional pleasure and emotional activation of the historical audio data are scored to obtain the tag information corresponding to the historical audio data. Extract the Mel-spectral features corresponding to the historical audio data; The Mel-spectral features corresponding to the historical audio data and the label information are input into the convolutional neural network model to be trained to obtain the target convolutional neural network model.
5. The intelligent lighting control method according to claim 4, characterized in that, Also includes: A convolutional layer, a first linear rectified activation function, a max pooling layer, a flattening layer, a fully connected layer, a second linear rectified activation function, a batch normalized layer, and a dropout layer are sequentially connected. The output of the dropout layer is connected to the input of the first output neuron and the second output neuron, respectively. The outputs of the first output neuron and the second output neuron are then connected to the sigmoid activation function, respectively, to obtain the convolutional neural network model to be trained.
6. The intelligent lighting control method according to claim 1, characterized in that, The control of the light color and brightness of the light to be controlled based on the polar angle and polar radius in the polar coordinates includes: Find the target color corresponding to the polar angle in the polar coordinates from the pre-created target color wheel, and adjust the light color of the light to be controlled to the target color; The target brightness corresponding to the polar radius in the polar coordinates is calculated based on the preset brightness calculation formula, and the brightness of the light to be controlled is adjusted to the target brightness.
7. The intelligent lighting control method according to claim 6, characterized in that, Before searching for the target color corresponding to the polar angle in the polar coordinates from the pre-created target color wheel, the process also includes: The range of the area to be filled is determined based on the value range of the emotional pleasure level and the emotional activation level; Using the center point of the area to be filled as the center, divide the area to be filled into a sub-region at preset angle intervals to obtain a number of sub-regions to be filled; The emotional pleasure and emotional activation levels corresponding to each sub-region to be filled are analyzed to determine the filling color corresponding to each sub-region to be filled. The target color wheel is obtained by filling the corresponding sub-region with the fill color corresponding to each sub-region to be filled.
8. The intelligent lighting control method according to claim 6, characterized in that, The preset brightness calculation formula is: I=e x ×I0; Where e is a natural constant, x represents the polar radius in the polar coordinates, and I0 is the brightness determined in advance based on empirical values.
9. The intelligent lighting control method according to any one of claims 1 to 8, characterized in that, The extraction of Mel-spectral features corresponding to the audio data includes: Monitor the audio segmentation points in the audio data; The audio data is segmented based on the detected audio segmentation points, and the Mel spectrum features corresponding to each segmented audio segment are extracted respectively. Accordingly, the step of inputting the Mel spectrum features into the pre-trained target convolutional neural network model to obtain the first sentiment dimension score and the second sentiment dimension score output by the first output neuron and the second output neuron in the target convolutional neural network model, respectively, includes: The Mel-spectral features of each audio segment are input into a pre-trained target convolutional neural network model to obtain the first emotional dimension score and the second emotional dimension score corresponding to each audio segment.
10. The intelligent lighting control method according to claim 9, characterized in that, The monitoring of audio segmentation points in the audio data includes: Determine several target lyrics and their corresponding time information from the audio data to obtain the corresponding lyrics information; Determine the lyrics similarity matrix corresponding to the lyrics information; wherein, the matrix elements in the lyrics similarity matrix are the similarity between any two lines of the target lyrics; Based on the lyrics similarity matrix, the audio data is used to identify the verses and choruses, resulting in several verse and chorus segments; The timestamp sequence of the verse and chorus segments is determined based on the time information of the target lyrics, and the audio segmentation points in the audio data are determined based on the timestamp sequence.
11. The intelligent lighting control method according to claim 10, characterized in that, The process of determining several target lyrics in the audio data includes: The system monitors whether there are several adjacent target short lyrics in the audio data; wherein, the similarity between the spliced lyrics corresponding to the several adjacent target short lyrics and any target long lyrics in the audio data is greater than a first preset similarity threshold, and the similarity between each target short lyrics and the target long lyrics is less than the first preset similarity threshold. If the audio data contains several adjacent target short lyrics, then the spliced lyrics corresponding to the several adjacent target short lyrics are determined as one of the target lyrics in the audio data.
12. The intelligent lighting control method according to claim 11, characterized in that, The step of identifying the verses and choruses of the audio data based on the lyrics similarity matrix yields several verse and chorus segments, including: The diagonal elements of the lyrics similarity matrix and the target elements whose similarity is less than the second preset similarity threshold are set to zero to obtain the zero-set similarity matrix. A path search is performed on the zeroed-out similarity matrix to find a target path that meets a preset optimal path condition; the preset optimal path condition is a condition constructed based on a preset path length threshold. Based on the target path, determine the corresponding verse and chorus segments in the audio data.
13. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the intelligent lighting control method as described in any one of claims 1 to 12.
14. A computer-readable storage medium, characterized in that, Used to store computer programs; wherein, when the computer programs are executed by a processor, they implement the intelligent lighting control method as described in any one of claims 1 to 12.