Music processing system, music processing program, and music processing method
By generating music through a learning model and shaping it into harmonious content using shaping units, the problem of generating new harmonious pieces in existing technologies is solved, and harmonious music can be generated based on the original piece.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TMIK INC
- Filing Date
- 2021-07-19
- Publication Date
- 2026-06-23
Smart Images

Figure CN116210049B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a music processing system, music processing program, and music processing method, which can be applied, for example, to the creation of new musical pieces. Background Technology
[0002] In the past, systems that assist users who do not have the knowledge to create music in order to make it easier for them to generate music have been described, for example, in patent document 1.
[0003] Patent Document 1 describes a system that assists in arranging an original song by changing the degree of adaptation through user operation. In the system described in Patent Document 1, when defining the state of a sound using at least one of three attributes (pitch, duration, and volume) excluding the timing of sound production, multiple transition probability data sets are maintained, each with a probability of transitioning from one state to the next. Furthermore, in the system described in Patent Document 1, the degree of adaptation of the original song can be changed by selecting the available transition probability data, so even users with almost no music-related knowledge can arrange the original song by changing the degree of adaptation.
[0004] Existing technical documents
[0005] Patent documents
[0006] Patent Document 1: Japanese Patent Application Publication No. 2009-20323 Summary of the Invention
[0007] The problem the invention aims to solve
[0008] However, in the technology described in Patent Document 1, since the timing of pronunciation is removed from the attributes of the transition probability data, it is at most an adaptation of the original song, and it is difficult to say that it is a composition.
[0009] Therefore, there is a need for a music processing system, music processing program, and music processing method that can generate newly composed music from the original music as input.
[0010] Technical means to solve the problem
[0011] The music processing system of the present invention is characterized by having: (1) a music generation unit that uses a learning model to generate music, the learning model being obtained by machine learning based on input data containing music data and composition information, the music data describing the score of a music consisting of at least one melody and at least one chord, the composition information representing the attributes of the elements constituting the music data; and (2) a shaping unit that shapes the generated music generated by the music generation unit into musically harmonious content.
[0012] The music processing program of the second invention is characterized in that the computer functions as (1) a music generation unit and (2) a shaping unit. The (1) music generation unit uses a learning model to generate music. The learning model is obtained by machine learning based on learning data with learning music data. The learning music data describes the score of a music consisting of a melody with more than 1 channel and a chord with more than 1 channel. The (2) shaping unit shapes the generated music generated by the music generation unit into musically harmonious content.
[0013] The third invention is a music processing method for use in a music processing system, characterized in that: (1) the music processing system has a music generation unit and a shaping unit; (2) the music generation unit uses a learning model to generate music, the learning model is obtained by machine learning based on learning data with learning music data, the learning music data describes the score of a music composed of at least one melody and at least one chord; (3) the shaping unit shapes the generated music generated by the music generation unit into musically harmonious content.
[0014] The effects of the invention
[0015] According to the present invention, a newly created song can be generated using the original song as input. Attached Figure Description
[0016] Figure 1 This is a block diagram illustrating the functional configuration of the music processing system according to the first embodiment.
[0017] Figure 2 This is a block diagram illustrating an example of the configuration of the AI used in the generation processing unit of the first embodiment during learning.
[0018] Figure 3 This is a block diagram illustrating an example of the configuration of the music generation process in the generation processing unit of the first embodiment.
[0019] Figure 4 An example of representing the input musical piece in the form of a full score according to the first embodiment.
[0020] Figure 5 This is a diagram showing, in tabular form, the content obtained by converting the score of the melody channel in the input music example of the first embodiment into ID (numerical value).
[0021] Figure 6 This is a diagram showing, in tabular form, the content obtained by converting the score of the chord channel in the input music example of the first embodiment into ID (numerical value).
[0022] Figure 7A diagram showing the conversion table used to convert each note of the melody channel to an ID in the first embodiment.
[0023] Figure 8 A diagram showing a conversion table for converting each chord of the chord channel to an ID in the first embodiment.
[0024] Figure 9 A flowchart illustrating the process of the shaping process performed by the shaping processing unit in the first embodiment.
[0025] Figure 10 This diagram illustrates an example of the pre-shaping music (generated music) processed in the first embodiment.
[0026] Figure 11 A diagram illustrating the chord progressions in the pre-shaping musical example processed in the first embodiment.
[0027] Figure 12 This is a graph showing the counting results of each adjustment by the shaping processing unit in the first embodiment.
[0028] Figure 13 A diagram illustrating an example of a chordally shaped musical piece processed in the first embodiment.
[0029] Figure 14 A diagram illustrating an example of a melody-shaping piece of music processed in the first embodiment.
[0030] Figure 15 This diagram illustrates an example of generating music when [operation parameter = 0] is set in the first embodiment.
[0031] Figure 16 This diagram illustrates an example of generating music when [operation parameter = 10] is set in the first embodiment.
[0032] Figure 17 This diagram illustrates an example of generating music when [operation parameter = 20] is set in the first embodiment.
[0033] Figure 18 This is a block diagram illustrating the functional configuration of the music processing system in the second embodiment.
[0034] Figure 19 This is a block diagram illustrating an example of the configuration of the music generation process in the generation processing unit of the second embodiment. Detailed Implementation
[0035] (A) First Embodiment
[0036] Hereinafter, with reference to the accompanying drawings, a first embodiment of the music processing system, music processing program, and music processing method of the present invention will be described in detail.
[0037] (A-1) Configuration of the first embodiment
[0038] Figure 1 This is a block diagram illustrating the overall configuration of the music processing system 10 according to this embodiment.
[0039] Music processing system 10 is a system for generating and outputting new musical pieces.
[0040] The music processing system 10 may be entirely composed of hardware (such as dedicated chips), or it may be partially or entirely composed of software (programs). For example, the music processing system 10 can be configured by installing a music processing program (containing the implementation method) on a computer equipped with a processor and memory. Furthermore, the number of computers constituting the music processing system 10 is not limited, and it can be implemented by distributing the program and data across multiple computers.
[0041] When the music processing system 10 receives data containing input music data, composition information (hereinafter also referred to as "input data") and operation parameters, it performs processing to generate and output a new music using the input data. Hereinafter, the music data output by the music processing system 10 will be referred to as "output music data".
[0042] Next, the input music data and output music data will be explained.
[0043] In this embodiment, the data format of the input / output music data (the form of the input and output music data) is described as a standard MIDI file (Standard Musical Instrument Digital Interface File, hereinafter referred to as "SMF"). However, the data format used for input / output music data is not limited to the standard MIDI file format; various performance information (score data) formats can be used. Furthermore, in the music processing system 10, the data format used for input / output music data can also be a direct audio signal such as WAV or MP3, instead of performance information formats like SMF. In the music processing system 10, when the input / output music data is in the form of an audio signal, the input music data is converted into performance information data such as SMF, and the converted audio signal data is output as output music data. Various processing methods can be used in the music processing system 10 for converting audio signal data into performance information data and vice versa (music playback processing), so detailed descriptions are omitted.
[0044] Furthermore, in this embodiment, the input music data and the output music data are described in the same data format (SMF format), but of course, they can also be set to different formats.
[0045] In this embodiment, the unit of the music processed by the music processing system 10 (e.g., length, number of channels (channels on MIDI), etc.) is not limited. That is, in this embodiment, the length of the unit of the music processed by the music processing system 10 can be set to a fixed length (e.g., a predetermined number of measures) or a variable length. In this embodiment, it is described as a fixed length of 8 measures (32 beats, 16 measures in 2 / 2 time) in 4 / 4 time conversion. Furthermore, in this embodiment, it is described as a music processed by the music processing system 10 having a total of 2 channels: melody channel 1 and chord (chord accompaniment to the melody) channel 1. Hereinafter, the channel constituting the melody of the music will be referred to as the "melody channel," and the chord channel will be referred to as the "chord channel." Moreover, in the music processed by the music processing system, there can be multiple (more than 2) melody channels and chord channels respectively.
[0046] Next, the "composition information" will be explained.
[0047] The composition information refers to parameters representing the attributes (types) of elements in each section of the input music. In this embodiment, any attribute from the intro, section A, section B, or chorus is used as an element of the input music. The attributes of the elements that can be used are not limited to those mentioned above, and various forms (such as first theme, second theme, etc.) can be used.
[0048] The composition information can be constructed in a form different from the input music data, but it can also be achieved by embedding information corresponding to the composition information in the markers of the SMF. In the SMF, markers are prepared in the form of fields where users can write any text. Therefore, it can also be set up so that text corresponding to the composition information (such as intro, A section, B section, chorus, etc.) is written in the markers of the input music data (SMF data).
[0049] For example, in SMF data supplied as input music data, if the marker for the timing (position) at the beginning of measure 1 is set to "Intro" and the marker for the timing at the beginning of measure 5 is set to "Section A", the music processing system 10 identifies the interval from measure 1 to measure 4 as "Intro", and the interval from measure 5 onwards (measure 5 to measure 8) as Section A. Furthermore, for example, in SMF data supplied as input music data, if the marker for the timing (position) at the beginning of measure 1 is set to "Section A" and no other marker is set, the music processing system 10 identifies all intervals (measure 1 to measure 8) as "Section A". As described above, in SMF data, the attributes of the elements of each interval can be written into the marker.
[0050] The specific form in which compositional information is described separately from SMF data is not limited, as long as the attributes of the elements in each section of the input music are described. For example, compositional information can be described using information strings that pair timing (position) and attributes representing elements (such as text or values corresponding to section A, section B, and the intro), just like SMF markings. Furthermore, when the same attributes are used for all sections of the input music, the compositional information can be set only with parameters and text corresponding to the attributes, without setting timing-related information.
[0051] Next, the "operation parameters" will be explained.
[0052] Operation parameters are parameters in the music processing system 10 that can be used as an interface for operating the characteristics of the music to be generated from the user's acceptance. In this embodiment, the operation parameters are described as being represented by a single numerical value (one-dimensional parameter). However, the operation parameters are not limited to a single numerical value (multi-dimensional parameter) and can also be represented by multiple numerical values (multi-dimensional parameter) or by a form other than numerical values (such as a flag form like TRUE / FALSE). Details of the operation parameters will be described later.
[0053] Next, the internal structure of the music processing system 10 will be explained.
[0054] like Figure 1 As shown, the music processing system 10 includes a vectorization processing unit 101, a generation processing unit 102, a shaping processing unit 103, and a restoration processing unit 104.
[0055] The vectorization processing unit 101 converts the data containing input music data and composition information into vector data (hereinafter referred to as "input music vector data") in a form suitable for processing in the subsequent generation processing unit 102. Then, the vectorization processing unit 101 supplies the acquired input music vector data to the generation processing unit 102. The specific form of the input music vector data will be described later.
[0056] The generation processing unit 102 generates and outputs vector data (data in the same form as the input music vector data, hereinafter referred to as "generated music vector data") corresponding to the new music (hereinafter referred to as "generated music") by processing the AI generation model based on the input music vector data and operation parameters. The generation processing unit 102 supplies the generated music vector data to the shaping processing unit 103. The detailed configuration of the generation processing unit 102 will be described later.
[0057] The shaping processing unit 103 shapes the music that generates the music vector data into musically harmonious content (such as unifying the overall key, adjusting the scale between melody and chords, etc.), and outputs it as "shaped music vector data".
[0058] The music that generates the music vector data is the original content output from the AI (generation processing unit 102), so there may be cases of inconsistency in key, mismatched scales between melody and chords, and other dissonances from a musical (music theory) perspective. Therefore, in the music processing system 10, a shaping processing unit 103 is provided to output a music that has been musically shaped. Furthermore, in cases where it is not necessary to shape the generated music vector data output from the AI (generation processing unit 102) (e.g., when it is not needed on the user side, or when the content is musically harmonious from the beginning), the shaping processing unit 103 can be omitted, and the generated music vector data can be directly supplied to the restoration processing unit 104.
[0059] The restoration processing unit 104 restores (converts) the shaped music vector data into music data in a specified form (SMF in this embodiment) and outputs it as "output music data".
[0060] Next, an example of the configuration of the AI used in the generation processing unit 102 will be described.
[0061] In the generation processing unit 102, AI is constructed based on a learning model obtained by machine learning based on deep learning. Specifically, in the generation processing unit 102, a learning model obtained by learning under the VAE (Variational AutoEncoder) architecture is acquired.
[0062] The platform (middleware) for the AI used in the generation processing unit 102 is not limited, and various platforms can be used. In this embodiment, the generation processing unit 102 is described below using Python (registered trademark) and its related libraries.
[0063] Figure 2 This diagram illustrates a configuration example of the AI used in the generation processing unit 102 during learning (when acquiring a learning model).
[0064] like Figure 2 As shown, in the generation processing unit 102, at least the encoder 201, decoder 202, discriminator 203, and latent variable processing unit 204 operate during learning.
[0065] The encoder 201 obtains and outputs the mean vector μ of the latent variables and the variance vector σ representing the probability distribution based on the input music vector data.
[0066] During learning, the latent variable processing unit 204, according to the VAE architecture, obtains the value obtained by adding noise corresponding to the standard deviation σ to the mean vector μ as the latent variable z (a sample of the latent variable) and supplies it to the decoder 202. At this time, the latent variable processing unit 204 can obtain the latent variable z according to, for example, the following equation (1). For example, I = 1 can be set in equation (1).
[0067] z=μ+εσ(ε~N(0,I))…(1)
[0068] For example, when the latent variable z is a 256-dimensional vector, the source code (in the case of Python) used by the latent variable processing unit 204 to obtain the latent variable z during learning can be set as "μ+numpy.random.normal(loc=0.0, scale=1*σ,size=256)".
[0069] Decoder 202 outputs vector data obtained by restoring the latent variable z (hereinafter referred to as "restoring the musical vector data"). Figure 2 In the VAE architecture shown, the generator program (generator) consists of encoder 201 and decoder 202. Discriminator 203 identifies whether the restored music vector data is generated by the generator program. In the generation processing unit 102, the generator program learns in a way that prevents the discriminator 203 from recognizing that the vector data was generated by the generator program. Figure 2 During the learning process, the generation program uses the recognition results of the discriminator 203 and the LOSS (the difference between the input music vector data and the restored music vector data) for learning processing. Figure 2 Illustrations have been omitted for the sake of simplicity.
[0070] During the learning process of the generation processing unit 102, samples (input music vector data) for learning can be successively supplied from the vectorization processing unit 101. The number of samples used in the learning process of the generation processing unit 102 is not limited, and approximately 1000 to 3000 samples can be used. Furthermore, during the learning process of the generation processing unit 102, each sample (one input music vector data) can undergo approximately 1000 learning processes (approximately 1000 learning processes until the reconstructed music vector data is generated based on the latent variable z).
[0071] Furthermore, by changing the ratios of the genre (e.g., pop, jazz, etc.) and artist of the music used as the basis for learning in the generation processing unit 102 (input music vector data), the characteristics of the music generated by the generation program can also be altered. This is because, essentially, when AI learning is performed through a VAE architecture, the distribution range of the latent variable z corresponds to the samples used in the learning process.
[0072] In this implementation, the latent variable z has a fixed size of 256 dimensions, but the size of z is not limited to this. Ideally, the size of the latent variable z should vary according to the size of the vector data being processed (input music vector data / restored music vector data).
[0073] Figure 3 This diagram illustrates a configuration example of generating music vector data using a model learned by the AI used in the generation processing unit 102 (hereinafter referred to as "music generation time").
[0074] Figure 3 In the middle, regarding the above Figure 2 Identical or corresponding parts are marked with the same or corresponding symbols.
[0075] like Figure 3 As shown, in the generation processing unit 102, at least the encoder 201, decoder 202 and latent variable processing unit 204 operate during music generation.
[0076] The encoder 201 and decoder 202 operate in the same way as during learning, so detailed descriptions are omitted.
[0077] The latent variable processing unit 204 mixes (adds) noise corresponding to the variance vector σ and the operating parameter c into the latent variable z during music generation, which is different from the learning process.
[0078] Specifically, when generating a piece of music, the latent variable processing unit 204 sets I = c in the above formula (1), thereby correcting the noise to a value that reflects the operation parameter c in the latent variable z. The range of the operation parameter c can be set without limitation, and it can be set to be adjustable by the user. For example, the operation parameter c can be set within the range of 0 to 10, or it can be changed within the range of 0 to 50 in units of a specified grid width (e.g., 10). Furthermore, the latent variable processing unit 204 does not limit the method of receiving user input (e.g., the input receiving device, the configuration of the operation screen) for the value of the operation parameter c.
[0079] For example, when the latent variable z is a 256-dimensional vector, the source code (in the case of Python) used by the latent variable processing unit 204 to obtain the latent variable z when generating the music can be set to "μ+numpy.random.normal(loc=0.0, scale=c*σ,size=256)".
[0080] (A-2) Operation of the first embodiment
[0081] Next, the operation of the music processing system 10 in the first embodiment having the above configuration (the music processing method of the embodiment) will be described.
[0082] First, the details of the processing by the vectorization processing unit 101 will be explained.
[0083] As described above, in the music processing system 10, music is processed in 8-bar (32-beat) units in 2 channels.
[0084] Figure 4An example is shown where the input musical piece is represented by a full score.
[0085] Furthermore, Figure 4 The input music shown is a section from "Dance of the Tartars" (Dance of the Porowitz) composed by Alexander Borodin.
[0086] Figure 4 The score for the melody and chord channels of the input music is presented in full score form. Furthermore, the instruments (instrument names on MIDI) for each music channel are piano-type instruments.
[0087] First, the vectorization processing unit 101 encodes (serializes) the arrangement of notes in each channel of the input music data using 48th-note units (1 / 12th-beat units). Here, the input music data is 8 measures (32 beats), so encoding each channel generates 8 × 48 = 384 code strings. Each code is represented by a single numerical value (hereinafter referred to as "ID").
[0088] Figure 5 To illustrate in tabular form Figure 4 The content obtained by converting (encoding) the score of the melody channel in the input music into an ID is shown.
[0089] Figure 6 To illustrate in tabular form Figure 4 The content obtained by converting (encoding) the score of the chord channel in the input music into ID.
[0090] exist Figure 5 , Figure 6 In the table, column 1 is configured with IDs for 1 section (4 beats) (setting 48 IDs for the time slot).
[0091] Figure 7 An example is shown of a conversion table used to convert each note in the melody channel to an ID.
[0092] like Figure 5 , Figure 7 As shown, in the melody channel, the timing (slot) of each note's notehead is set to an ID corresponding to the pitch (ID after 2), while the timing (slot) of a rest's notehead is set to ID "1". Furthermore, as... Figure 5 , Figure 7 As shown, in the melody channel, a code of "0" is set for timings (slots) other than the notehead timing of a note or rest to continue the previous state. Specifically, as... Figure 7As shown, the ID "0" indicates "the state of the previous ID continues," "the state of the state before that of the previous ID continues when the previous ID is 0," "no sound is emitted when the previous ID is 1," and "in other cases, the corresponding tone of sound continues to ring."
[0093] Figure 8 This shows a conversion table for converting each chord in the chord channel to an ID.
[0094] like Figure 6 , Figure 8 As shown, in the chord channel, the timing (slot) of the notehead for each chord (chord) is set to an ID (ID after 2) corresponding to the chord type (combination of chords), and the timing (slot) of the rest notehead is set to an ID of "1". In the chord channel, the lowest pitch corresponds to C2 in MIDI, and the highest pitch corresponds to B5. Furthermore, as... Figure 6 , Figure 8 As shown, in the chord channel, a code of "0" is set for timings (slots) other than the notehead timing of a note or rest to continue the previous state. Specifically, as... Figure 8 As shown, the code "0" indicates the following: "The state of the previous ID continues.", "When the previous ID is 0, the state of the state before that continues.", "When the previous ID is 1, no sound is emitted.", "In other cases, the sound of the corresponding chord continues to play." and so on.
[0095] As described above, the vectorization processing unit 101 converts the input music data into a sequence for each channel, thereby obtaining a one-hot vector (data in a format suitable for AI processing) corresponding to each channel. Hereinafter, the block of data obtained by encoding (ID-ization / sequence-ization / One-hot Vectorization) for each channel will be referred to as "encoded input music data". The encoded input music data includes a sequence for the melody channel (a string / code string / One-hot Vector of 384 IDs) and a sequence for the chord channel (a string / sequence / One-hot Vector of 384 IDs). Hereinafter, the values of the melody channel sequence will be represented as Mi (i represents the slot number from 1 to 384 (the order of the time sequence)), and the values of the chord channel sequence will be represented as Ci. Specifically, the melody channel sequence will be represented as M1, M2, M3, ..., M384, and the chord channel sequence will be represented as C1, C2, C3, ..., C384.
[0096] Next, the process by which the vectorization processing unit 101 encodes (digitizes) the constituent information will be explained.
[0097] As described above, in the vectorization processing unit 101, the attributes of the elements in each section of the input music (e.g., intro, section A, section B, chorus, etc.) can be grasped based on the composition information. Therefore, the vectorization processing unit 101 grasps the attributes of the elements corresponding to each time slot of the input music and obtains a sequence of numbers obtained by setting values (codes) corresponding to the elements (attributes of the elements) for each time slot.
[0098] The allocation of numerical values (codes) corresponding to the attributes of each element is not limited. In this embodiment, the numerical values corresponding to each element are allocated within the range of 0 to 50. Specifically, in the example of this embodiment, the numerical value corresponding to section A is allocated as any one of 10 to 19, the numerical value corresponding to section B is allocated as any one of 20 to 29, the numerical value corresponding to the chorus is allocated as any one of 30 to 39, and the numerical value corresponding to the intro is allocated as any one of 40 to 49. For example, in the vectorization processing unit 101, the numerical value corresponding to section A can be set to 10, the numerical value corresponding to section B can be set to 20, the numerical value corresponding to the chorus can be set to 30, and the numerical value corresponding to the intro can be set to 40. When the attributes of elements are different, by leaving a certain degree of interval in the numerical values, the AI can easily distinguish the features determined by the attributes of the elements. In addition, when multiple intervals of the same element are generated in the input music (for example, a continuous interval like section A, section B, section A), the vectorization processing unit 101 can also set different numerical values for the repeatedly generated intervals. For example, in a continuous sequence like segment A, segment B, segment A, the interval for the first segment A can be set to 30, and the interval for the second segment A can be set to 31. Furthermore, in this embodiment, the parameters corresponding to the constituent elements are set to one dimension, but they can also be constructed in multiple dimensions. For example, assuming that three parameters F, G, and H are set as parameters corresponding to the constituent elements, segment A can be defined as "F=1, G=0, H=0", segment B as "F=0, G=1, H=0", and the chorus as "F=0, G=0, H=1".
[0099] As described above, in the vectorization processing unit 101, values corresponding to the attributes of elements can be set for each time slot (384 time slots) of the input music based on the composition information. Hereinafter, each value of the sequence based on the composition information will be represented as Ei. Specifically, the sequence of each time slot based on the composition information will be represented as E1 to E384.
[0100] Subsequently, the vectorization processing unit 101 assembles the sequence of data constituting the encoded input music data (sequences of melody channels and chord channels) and the sequence based on the constitutive information into vector data (determinant) suitable for AI processing.
[0101] In this case, the determinant generated by the vectorization processing unit 101, as shown in equation (2), can be obtained as vector data. The determinant in equation (2) is a row with data of 1 time slot (48th note). That is, in equation (2), the i-th row (i is any integer from 1 to 384) is composed of (Mi, Ci, Ei).
[0102] [Formula 1]
[0103]
[0104] Furthermore, in this embodiment, the vector data (input music vector data, restored music vector data, etc.) processed by the AI (generation processing unit 102) of the music processing system 10 are all in the form of equation (2). The form of the vector data is not limited to (2). As long as it is composed of the same sequence, the specific arrangement order and the composition of each row can be set to other forms (for example, the form of 24th note units forming 1 row).
[0105] Furthermore, as described above, in this embodiment, the sequences of the melody channel and chord channel in the input music vector data are serialized (ID-ized) in 48th note units, but in essence, it can be said to be data in the form of musical score (performance information), just like the original input music data (SMF data). Therefore, in the music processing system 10, the data obtained by serializing the data of the melody channel and chord channel can be accepted as input music data from the beginning. Furthermore, in the music processing system 10, the data can be accepted as input music vector data from the beginning. In this case, the vectorization processing unit 101 can be omitted from the music processing system 10.
[0106] Next, details of the plastic surgery performed by the plastic surgery processing unit 103 will be explained.
[0107] As described above, the shaping processing unit 103 shapes the generated music vector data and outputs it as shaped music vector data. Furthermore, the music corresponding to the generated music vector data will be referred to as "music before shaping" and the music with shaped music vector data will be referred to as "shaped music".
[0108] In this embodiment, the pre-shaping piece is about 8 bars (32 beats) long, so the shaping processing unit 103 performs overall tuning unification on the pre-shaping piece in the form of tuning processing. Furthermore, the tuning processing unit 103 may also divide the pre-shaping piece into multiple sections and determine the tuning unification for each section separately and perform subsequent tuning processing thereafter.
[0109] In this embodiment, the shaping processing unit 103 performs shaping processing in the state of vector data (generating music vector data). However, the order of the shaping processing unit 103 and the restoration processing unit 104 can be reversed so that the shaping processing is performed after the data is restored to the form of SMF data.
[0110] Figure 9 A flowchart illustrating the process of the shaping process performed by the shaping processing unit 103.
[0111] First, the shaping processing unit 103 performs a process to infer the key that is suitable as a unified key for the pre-shaping music (hereinafter referred to as "key inference processing"), and determines the unified key (hereinafter referred to as "unified key") according to the result of the key inference processing (S101).
[0112] Next, the shaping processing unit 103 shapes the chord path of the piece before shaping by simply changing it into chords commonly used in the unified key (hereinafter referred to as "chord shaping processing") (S102). Hereinafter, the piece after the chord shaping processing of the piece before shaping will be referred to as "the piece with chord shaping".
[0113] Next, the shaping processing unit 103 processes each note in the melody channel of the chord-shaped piece to harmonize with the chords in the chord channel (hereinafter referred to as "melody shaping processing"), and obtains the chord-shaped piece (S103). Specifically, the shaping processing unit 103 adjusts (shapes) the pitch of each note in the melody channel of the chord-shaped piece to match the scale (hereinafter referred to as "chord scale") of the chords (hereinafter referred to as "corresponding chords") that sound simultaneously (sound in the same time sequence) in the chord channel.
[0114] Next, the details of the inference process in step S101 will be explained.
[0115] In the key deduction process, the shaping processing unit 103 deduces which of the 24 keys ((major key × 12 notes) + (minor key × 12 notes) = 24) is suitable as the unified key for the piece before shaping.
[0116] In this embodiment, the shaping processing unit 103 enumerates the degree to which the chords contained in the pre-shaping music match the chords used in each key, and infers (determines) the key with the most matches to these chords as the best unified key.
[0117] Figure 10 This is a diagram illustrating an example of a pre-reform musical piece (the generated piece) presented in full score form.
[0118] Figure 11 To indicate Figure 10The diagram shows the chord progressions in the pre-shaping piece.
[0119] Figure 11 In the text, the 14 chords that constitute the pre-remodeling piece are marked with symbols C01 to C14, starting from the beginning. Furthermore, Figure 10 In the text, chord names are noted for chords C01 to C14. For example... Figure 10 As shown, the chords C01 to C14 are [DM7], [A7], [Am7], [E], [Bm7], [Esus4], [D7], [B7], [Am7], [E7], [Em7], [A7], [Em], [Em7].
[0120] The shaping processing unit 103 counts the number of notes in each chord of the pre-shaping piece in each key (all 24 keys) of the whole-tone chords. The counting result given by the shaping processing unit 103 at this time is shown below. Figure 12 .
[0121] Figure 12 This is a diagram showing the result of counting how many notes of each chord in each key constitute the pre-shaping piece when the shaping processing unit 103 performs whole-tone scale chords in each key.
[0122] Figure 12 The table shows the number of diatonic chords included in each key (hereinafter referred to as "counts").
[0123] For example, like Figure 11 , Figure 12 As shown, the whole-tone chords of E minor are contained in a total of 8 chords (C03[Am7], C05[Bm7], C07[D7], C08[B7], C09[Am7], C11[Em7], C13[Em], C14[Em7]), so the count of E minor is 8.
[0124] Furthermore, in this case, such as Figure 12 As shown, the reading for E minor is 8, which is the highest. Therefore, in this case, the reshaping processing unit 103 deduces that E minor is the best key for unifying the piece before reshaping.
[0125] Next, the details of the chord shaping process in step S102 will be explained.
[0126] As described above, in the chord shaping process, the shaping processing unit 103 performs chord shaping processing on the chord channels of the pre-shaping piece in a manner that only changes to the chords commonly used in the unified key, thereby generating a piece that has undergone chord shaping.
[0127] First, the shaping processing unit 103 determines whether each chord in the chord path that constitutes the pre-shaping music is a chord that is consistent with the chords used in the unified key (hereinafter referred to as "consistent chord") or an inconsistent chord (hereinafter referred to as "inconsistent chord").
[0128] Subsequently, the shaping processing unit 103 adjusts (shapes) the inconsistent chords in each chord of the chord path that constitute the piece before shaping by turning them into whole-tone chords in a unified key (hereinafter referred to as "unified key chords").
[0129] At this time, the method by which the shaping processing unit 103 selects the target unified chord (hereinafter referred to as "adjustment target chord") for correcting each inconsistent chord is not limited, and the following strategy can be used for selection.
[0130] In general, ideally, the amount of change (the number of notes that change pitch) should be as small as possible during shaping. The shaping unit 103 can select the target chord for adjustment according to the following strategy.
[0131] [Strategy 1]
[0132] For inconsistent chords, select the chord with the most consistent constituent notes among the unified key chords as the target chord for adjustment.
[0133] [Strategy 2]
[0134] For a unified key chord that conforms to strategy 1, if there are multiple inconsistent chords, select the unified key chord with the smallest difference in the number of constituent notes (the largest number of consistent constituent notes) from the inconsistent chord as the adjustment target chord.
[0135] [Strategy 3]
[0136] For a unified chord that conforms to Strategy 2, if there are multiple inconsistent chords, the chord with the smallest index in implementation (e.g., the administrative number (ID number) assigned to each chord in implementation) is selected as the adjustment target chord. Alternatively, in this case, a chord randomly selected from multiple unified chords can also be used as the adjustment target chord.
[0137] exist Figure 10 In the pre-reform piece shown, if the key is set to E minor, the unifying chords are C03[Am7], C05[Bm7], C07[D7], C08[B7], C09[Am7], C11[Em7], C13[Em], and C14[Em7], while the unconforming chords are C01[DM7], C02[A7], C04[E], C06[Esus4], C10[E7], and C12[A7].
[0138] Figure 13 To express [the opinion / towards]Figure 10 The diagram shows the result of chord reshaping of the original piece of music (the piece with chord reshaping).
[0139] Figure 13 The image shows the result of setting the unison to E minor in the piece before the reshaping and correcting the inconsistent chords C01[DM7], C02[A7], C04[E], C06[Esus4], C10[E7], and C12[A7] to unison chords according to the above strategy.
[0140] For example, when the above strategy is applied to the chord C01[DM7] (constructing notes DF#AC#) in the first measure, it is changed to the unified key chord with the most consistent constituent notes under the whole tone chord of E minor, namely F#m7b5 (constructing notes F#ACE).
[0141] Next, the details of the melody shaping process in step S103 will be explained.
[0142] As described above, in the melody shaping process, the shaping processing unit 103 adjusts (shapes) each note in the melody channel of the melody that has already undergone chord shaping, so that it becomes the constituent notes of the scale of the corresponding chord (hereinafter referred to as "corresponding chord scale"). Hereinafter, in the melody channel, notes that are not constituent notes of the corresponding chord scale, and therefore will be the subject of adjustment, are referred to as "adjustment target notes".
[0143] Furthermore, the corresponding chord scale is basically the scale of the corresponding chord (for example, if the corresponding chord is Am7, then the corresponding chord scale is the scale of A minor). However, in the case of the corresponding chord being an add9 chord, the Lydian scale corresponding to the root note of the corresponding chord can be regarded as the corresponding chord scale.
[0144] At this time, the shaping processing unit 103 does not limit the method of adjusting the pitch of the adjustment target notes (each note in the melody channel), and can perform the following strategies. Furthermore, even if the notes are separate, notes of the same pitch connected by a slur can be regarded as one note (adjustment target note) and applied to the following strategies.
[0145] [Strategy 1]
[0146] For the notes being adjusted, the pitch is adjusted in a manner that consists only of the constituent notes of the corresponding chord scale.
[0147] [Strategy 2]
[0148] Regarding the notes that span multiple chord intervals (hereinafter referred to as "multi-chord corresponding notes"), the pitch is adjusted in such a way that they consist only of notes common to all corresponding chord scales of these multiple chords. For example, within a multi-chord corresponding note interval, there are 2 corresponding chords when the chord changes once, and 3 corresponding chords when the chord changes twice.
[0149] [Strategy 3]
[0150] For multi-chord corresponding notes that do not satisfy the second strategy, the chord is split at the break (the moment of chord switching), and each of the split notes is reshaped from the beginning as an individual adjustment target note (the processing applied according to the first strategy).
[0151] [Strategy 4]
[0152] When adjusting the pitch of the target note, the relative pitch relationship between the target note and the previous note (hereinafter referred to as "previous note") and the next note (hereinafter referred to as "next note") in the original piece is maintained (any one of the three types: "pitch rising", "pitch falling", "same pitch").
[0153] Hereinafter, the pitch of the note to be adjusted will be denoted as PT, the pitch of the preceding note as PB, and the pitch of the following note as PA. For example, in the relationship between the pitch PB of the preceding note and the pitch PT of the note to be adjusted, there are types such as PB = PT (same pitch), PB > PT (pitch is falling), and PB < PT (pitch is rising). Furthermore, for example, in the relationship between the pitch PT of the note to be adjusted and the pitch PA of the following note, there are types such as PT = PA (same pitch), PT > PA (pitch is falling), and PT < PA (pitch is rising).
[0154] [Strategy 5]
[0155] If adjusting the pitch of the note being adjusted alone is insufficient to satisfy Strategy 4, then the type of pitch adjustment that satisfies Strategy 4 is determined by adjusting the pitch of the next note as well.
[0156] [Strategy 6]
[0157] When adjusting the pitch of the note being adjusted, the difference in pitch before and after the adjustment should be within a specified range (e.g., within ±1 octave).
[0158] Ideally, the shaping processing unit 103 should adjust the notes to be adjusted using the strategies described above, in a manner consisting only of the constituent notes of the corresponding chord scale. Furthermore, if strategies 4 and 5 cannot be followed, these two strategies can be excluded during adjustment. Additionally, if following strategies 4 and 5 would prevent the fulfillment of strategy 6, either "excluding strategy 6" or "excluding strategy 4 or 5" can be used.
[0159] Figure 14 To express [the opinion / towards] Figure 13 The diagram shows the result of melody shaping after chord shaping of the music (the music with melody shaping).
[0160] like Figure 14 As shown, through melody shaping, the note F in the first measure is changed to F# (a note that forms the E minor scale and whose pitch relationship with the preceding and following notes remains unchanged).
[0161] (A-3) Effects of the first embodiment
[0162] According to the first embodiment, the following effects can be achieved.
[0163] In the music processing system 10 of the first embodiment, a newly produced music can be generated with input music data (original music) as input by means of a generative model using AI.
[0164] Furthermore, in the music processing system 10 of the first embodiment, noise corresponding to the operation parameter c can be mixed into (added to) the latent variable z. Since the operation parameter c can be set to any value, the user can change the value of the operation parameter c to generate multiple music pieces, thereby selecting from the generated music pieces to obtain content that can be said to be newly created when the input music data (original music) is used as input.
[0165] Next, use Figures 15-17 This section provides specific examples of how the generated music changes as the operating parameters are modified.
[0166] Figures 15-17 The input music is represented in the form of musical notation (staff notation). Figure 4 The music, and the generated music when the values of the operation parameters are changed to 0, 10, and 20.
[0167] like Figure 15 As shown, when the operation parameter is set to 0, a version that can be inferred to be similar to the original song is generated. Figure 4 Pieces in the same key (F# minor) or its parallel key (A major). Furthermore, in Figure 15 In the sheet music, one can see the original piece (Figure 4 Similarly, the melody is composed of four measures.
[0168] like Figure 16 As shown, when the operation parameter is set to 10, the original song is generated ( Figure 4 The subdominant key (D major) of the original piece, rather than the key of the original ( Figure 4 A piece of music in the same key.
[0169] like Figure 17 As shown, when the operation parameter value is set to 20, the tune not only matches the original tune ( Figure 4 The differences are vast, and the types of notes and rhythms used are also completely different.
[0170] As described above, in the music processing system 10 of the first embodiment, by changing the value of the operation parameter c, the generated music can be said to be newly created content with input music data (original music) as input.
[0171] (B) Second Implementation
[0172] Hereinafter, with reference to the accompanying drawings, a second embodiment of the music processing system, music processing program, and music processing method of the present invention will be described in detail.
[0173] (B-1) Structure and operation of the second embodiment
[0174] Figure 18 This is a block diagram illustrating the overall configuration of the music processing system 10A according to the second embodiment.
[0175] Figure 18 In the middle, regarding the above Figure 1 Identical or corresponding parts are marked with the same or corresponding symbols.
[0176] The differences between the second embodiment and the first embodiment will now be explained.
[0177] The music processing system 10A of the second embodiment differs from that of the first embodiment in that the generation processing unit 102 is replaced by the generation processing unit 102A.
[0178] In the generation processing unit 102A of the second embodiment, the configuration of AI learning is the same as that of the first embodiment, but the configuration of music generation is different.
[0179] Figure 19 This diagram illustrates a configuration example of music generation in the generation processing unit 102A of the second embodiment.
[0180] Figure 19 In the middle, regarding the aboveFigure 3 and the above Figure 1 Identical or corresponding parts are marked with the same or corresponding symbols.
[0181] like Figure 19 As shown, in the generation processing unit 102A of the second embodiment, only the latent variable processing unit 204A and the decoder 202 operate during music generation.
[0182] The difference from the first embodiment is that, when generating a song, the latent variable processing unit 204A does not rely on data from the encoder 201, but generates latent variables z from values obtained by a predetermined method (such as random numbers) and supplies them to the decoder 202.
[0183] For example, in equation (1) above, a latent variable z based on a random number with a variance of 1 can be obtained by setting μ = 0, σ = 1, and I = 1. For example, when the latent variable z is a 256-dimensional vector, the source code (in the case of Python) used by the latent variable processing unit 204A to obtain the latent variable z can be set to "numpy.random.normal(loc = 0.0, scale = 1.0, size = 256)".
[0184] Furthermore, in the second embodiment, the specific values set for σ and I are not limited to the examples described above, and various values can be used.
[0185] (B-2) Effects of the second embodiment
[0186] According to the second embodiment, the following effects can be obtained.
[0187] In the music processing system 10A of the second embodiment, a music is generated using a latent variable z obtained by the latent variable processing unit 204A based on random numbers, without relying on an input music. Therefore, in the music processing system 10A of the second embodiment, a new music can be generated without inputting an input music.
[0188] (C) Other implementation methods
[0189] The present invention is not limited to the above embodiments, and can also include the following modified embodiments.
[0190] (C-1) In the first embodiment, the action mode (hereinafter referred to as "reference mode") for the music processing system to generate music based on input music vector data (input music data) and operation parameters was described. In the second embodiment, the action mode (hereinafter referred to as "random mode") for the music processing system to generate music based on random numbers was described. However, a music processing system that can change the action mode according to user operations can also be constructed corresponding to these two action modes.
[0191] (C-2) In the above embodiments, the generation processing unit 102 is described in a way that has both the configuration during learning and the configuration during music generation. However, if the learning process has been completed, the configuration during learning may not be present (e.g., the discriminator 203).
[0192] Symbol Explanation
[0193] 10…Music processing system, 101…Vectorization processing unit, 102…Generation processing unit, 103…Shaping processing unit, 104…Restoration processing unit, 201…Encoder, 202…Decoder, 203…Discriminator, 204…Latent variable processing unit.
Claims
1. A music processing system, characterized in that, The system includes a music generation unit that uses a learning model to generate music. This learning model is obtained through machine learning based on input data containing music data and compositional information. The music data describes the score of a piece of music consisting of at least one melody and at least one chord. The compositional information represents the attributes of the elements that constitute the music data. The music generation unit has the following features: An encoder that uses the learning model to output the mean vector and variance vector of the latent variables corresponding to the input data, based on the input data; A latent variable processing unit processes the mean vector and variance vector to generate latent variables; as well as The decoder uses the learning model to output output data in the same form as the input data corresponding to the latent variables generated by the latent variable processing unit. The music generation unit accepts both input data and input parameters for operating on the properties of the music to be generated. The latent variable processing unit mixes noise corresponding to the combination of the variance vector and the operating parameters into the latent variables.
2. The music processing system according to claim 1, characterized in that, It also has a shaping unit that shapes the generated music produced by the music generation unit into musically harmonious content.
3. A music processing program product, comprising a music processing program, characterized in that, This music processing program enables the computer to function as a music generation unit. The music generation unit uses a learning model to generate music. This learning model is obtained through machine learning based on learning music data and compositional information. The learning music data describes the musical score of a piece of music consisting of at least one melody and at least one chord. The compositional information represents the attributes of the elements that constitute the music data. The music generation unit has the following features: An encoder that uses the learning model to output the mean vector and variance vector of the latent variables corresponding to the input data, based on the input data; A latent variable processing unit processes the mean vector and variance vector to generate latent variables; as well as The decoder uses the learning model to output output data in the same form as the input data corresponding to the latent variables generated by the latent variable processing unit. The music generation unit accepts both input data and input parameters for operating on the properties of the music to be generated. The latent variable processing unit mixes noise corresponding to the combination of the variance vector and the operating parameters into the latent variables.
4. A music processing method for use by a music processing system, characterized in that, The music processing system includes a music generation unit. The music generation unit uses a learning model to generate music. This learning model is obtained through machine learning based on learning music data and compositional information. The learning music data describes the musical score of a piece of music consisting of at least one melody and at least one chord. The compositional information represents the attributes of the elements that constitute the music data. The music generation unit has the following features: An encoder that uses the learning model to output the mean vector and variance vector of the latent variables corresponding to the input data, based on the input data; A latent variable processing unit processes the mean vector and variance vector to generate latent variables; as well as The decoder uses the learning model to output output data in the same form as the input data corresponding to the latent variables generated by the latent variable processing unit. The music generation unit accepts both input data and input parameters for operating on the properties of the music to be generated. The latent variable processing unit mixes noise corresponding to the combination of the variance vector and the operating parameters into the latent variables.