Real-time audio-to-digital note conversion
By using machine learning models to analyze the spectrogram of audio signals, MIDI format musical score data can be generated in real time, solving the problem of non-real-time audio signal conversion in existing technologies and improving the efficiency and quality of music creation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- MACDOUGAL STREET TECHNOLOGY INC
- Filing Date
- 2024-11-14
- Publication Date
- 2026-06-19
Smart Images

Figure CN122249850A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of audio processing, and more particularly to the conversion of real-time audio to digital musical notes (e.g., MIDI format). Background Technology
[0002] The methods described in this section are permissible but not necessarily methods that have been previously conceived or adopted. Therefore, unless otherwise stated, no method described in this section should be assumed to constitute prior art simply by virtue of its inclusion in this section.
[0003] Computer systems are widely used for audio processing. The acquisition, editing, encoding, storage, decoding, and reproduction of audio are key functions performed by computers today. Computer tools that perform these and other audio processing functions have greatly improved the quality of music production and consumption.
[0004] While most audio processing capabilities have improved significantly with the advancement of computer-related technologies, the transformation and generation functions of audio processing remain limited in scope. These functions primarily focus on improving the quality of existing music recordings or blending existing music sources. Therefore, digital audio processing lacks tools for creating new music, or at least for assisting in the creation of new music.
[0005] The main obstacle to computer systems generating music audio is the lack of tools that enable them to perform in-depth analysis of the acquired music audio signals. When composing music, musicians transcribe it into score and iterate over it. It is for this reason that music education includes music aural training courses, where perceived music audio is broken down into appropriate notes. The score sheets created by musicians are then used by musical performers to accurately reproduce the music audio on various instruments.
[0006] However, in some cases, musicians (in fact, many famous musicians) may lack the proper transcription of music to notes. Such musicians may have to rely on the recording studio or its staff to transcribe and iterate their music, which can lead to interruptions in the creative process, thus delaying the production of the music and reducing its quality. But the situation is even worse for young musicians who may also lack proper musical education. Such young musicians may have musical talent and enjoy composing music (e.g., improvising), but may not have enough funds to hire people to perform this task, significantly reducing their chances of success.
[0007] The Musical Instrument Digital Interface (MIDI) format digitizes musical scores by completely describing musical audio using musical notes. MIDI is the standard for digital music representation in computing systems and describes the communication protocols, digital interfaces, and electrical connectors that connect various electronic musical instruments, computers, and related audio devices for playing, editing, and recording music. The MIDI standard includes textual notations that represent various events similar to notes in a musical score.
[0008] Therefore, in order for a computer to generate music, it needs to capture the music audio as MIDI or a similar format. One way to capture music audio as a MIDI representation is to use specialized hardware. Such hardware uses physical sensors to detect various movements of the instrument (e.g., the vibration of strings), and thus detects events of a specific note / pitch being played. However, such hardware solutions are expensive and impractical for vocal and / or multi-instrument performances. Attached Figure Description
[0009] In some implementation diagrams, the same reference numerals refer to the corresponding parts throughout the diagrams: Figure 1 It is a block diagram depicting the data flow used to generate music notation data in the implementation scheme; Figure 2 It is a block diagram depicting an example of a frame and its window set of probabilistic tuples in the implementation scheme; Figure 3 It is a block diagram depicting the process used to generate note events in the implementation scheme; Figure 4 This is an example of a filtered set of windows; Figure 5 It is a block diagram depicting the process of determining the digital note representation for the next window set in the implementation scheme; Figure 6 This is a block diagram depicting an example of a sequential set of windows; Figure 7 It is a block diagram of the basic software system in one or more implementation schemes; Figure 8 This is a block diagram illustrating a computer system that can implement the present invention.
[0010] Detailed Implementation Plan In the following description, numerous specific details are set forth for purposes of explanation in order to provide a thorough understanding of the invention. However, it will be apparent, however, that the invention can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form to avoid unnecessarily obscuring the invention.
[0011] General Overview This paper describes a technique for converting audio into digital note representations in real time. Although the examples and implementations in this paper refer to the Musical Instrument Digital Interface (MIDI) format as digital note representation, the exact format used to digitally represent notes is not critical to the technique described herein.
[0012] To accurately represent captured audio data in digital note representation, large segments of audio data are processed in the implementation. The larger the segment, the more accurate the conversion from audio data to its frequency domain and the generation of notes based on the frequency-transformed data. However, the larger the segment, the more latency is introduced, and therefore, accurate transformation introduces so much latency between the audio signal and the resulting digital representation that it can no longer be considered real-time.
[0013] However, when real-time audio signals are processed in smaller portions (referred to as "windows" in this paper), no significant delay is introduced. Although each window may not correspond to enough audio samples to be accurately converted to digital music notation, the middle portion of the window is accurate when converted to digital music notation data. In the implementation, the next window is selected such that its filtered middle portion is a temporal continuation of the filtered middle portion of the previous window. Therefore, real-time audio signals can be processed in clusters of overlapping windows of frames. The term "frame" refers to a sequence of audio samples and their transformations. While processing the current window, audio samples for the next or subsequent windows are collected in real time through acquisition.
[0014] In the implementation, the sequence of audio samples for each frame is converted into a corresponding frequency domain frame, thereby generating a set of frequency domain frames for the window. Each set of frequency frames is transformed into a corresponding set of note event probability values. The probability values of the filtered set of frames for each window are converted into note-on and note-off events. The term "note-on event" refers to the event that detects the playing of a note that was not previously played in the audio signal. The term "note-off event" refers to the event that detects the end of the playing of a previously played note. Based on the probabilities determined for the frame, the technique describes determining whether a frame contains a note-on event, a note-off event, or neither for each note.
[0015] Data Flow Overview Figure 1This is a block diagram depicting the data flow used to generate musical score data in the implementation scheme. At block 110, an audio signal 100 is acquired in real time by sampling the signal at a predetermined sampling frequency. After acquiring at least one frame of audio samples of the audio signal 100, a frame of sampled audio data 120 is generated at the audio signal acquisition block 110. The duration of the frame can be pre-configured to be at least the minimum duration necessary to acquire enough audio samples to capture the spectrum of the audio signal. A non-limiting example could be a frame of 256 samples of the audio signal 100 acquired at a sampling frequency of 44.1 kHz, with each frame lasting 0.0058 seconds. Each generated frame in the audio signal acquisition block 110 can contain a sequence of amplitude values for the acquired audio signal 100 as part of the sampled audio data 120.
[0016] At frequency domain transformation box 130, the process performs a frequency domain transformation for each frame of the sampled audio data 120, thereby generating spectrogram data 140 for each frame. This transformation converts the sequence of sampled audio data of the frames in the time domain into frequency component values in the frequency domain, thus producing spectrogram data 120. Frequency domain transformation box 130 can use any frequency transformation method, including but not limited to the constant Q transform (CQT).
[0017] In the implementation, when a specific number of spectrogram frames are generated, a time series set of these spectrograms is selected to generate the note event probability at box 150. The term "window set" refers to such a time series set where the members of the set are associated with the duration of a window. The size of the window can be configured using either the number of members or the duration itself. For example, the size of the window set for the spectrogram frames can be configured to contain 100 frames.
[0018] Because subsequent processing of audio-based data occurs on a per-window basis, the window size at least partially determines the latency of real-time audio processing. The larger the window size, the longer it takes to generate the next set of notes. Therefore, real-time conversion from audio to digital notes is performed by keeping the window size below a few seconds (e.g., 2 seconds).
[0019] Event Probability Generation In the implementation, the event probability generation box 150 receives a window set of frames of the spectrogram as input to the spectrogram data 140 and uses one or more statistical / predictive algorithms to determine the event probability of each note in each frame of the window. The event probabilities can be arranged as frames of probability tuples 160. The probability tuples 160 can contain different probabilities, such as a note-on probability (indicating the probability that the playing of the corresponding note will begin in the corresponding frame (transitioning to the note-on state)) and / or a note-on probability (indicating the probability that the corresponding note is being played in the corresponding frame (in the note-on state)). Therefore, the event probability generation box 150 generates an output window set of frames corresponding to the input window set of frames, each frame of the output window set including a tuple for the probability value for each note.
[0020] In the implementation, to generate probability tuples, the event probability generation box 150 includes a machine learning model, which is generated by training a corresponding machine learning algorithm using a training dataset of known notes for the spectrogram data. For example, one or more convolutional neural network (CNN) models are used in the event probability generation box 150 to generate note event probabilities. Such a CNN may include one or more convolutional layers, pooling layers, and / or fully connected layers. The convolutional layers use kernels to perform convolutions, where the dimension of the kernel can be a hyperparameter, and its weights are model artifacts of the CNN. Convolution is performed by sliding the kernel across the input tensor of the spectrogram data 140 and calculating the dot product between its weights and the area covered by the input tensor.
[0021] The selection of hyperparameter values for CNNs and other machine learning techniques is described in the sections “Machine Learning Algorithms and Domains” and “Hyperparameters, Cross-Validation, and Algorithm Selection”. In one or more implementations, the CNN for generating event probability boxes 150 can be trained according to the techniques described in “Training Machine Learning Models”.
[0022] In the implementation, the initial input tensor of the CNN model is arranged such that each frame of the spectrogram data 140 in the window is a column in the initial input tensor, and the rows correspond to frequency values. The frequency values in the tensor can be normalized for the CNN to perform processing more efficiently.
[0023] The output of the event probability generation box 150 can be a window set of frames of probability tuples 160. For each input frame in the window set of spectrogram data 140, box 150 generates a corresponding output frame in probability tuples 160. In the output frame, the probability tuples are arranged such that each probability tuple corresponds to a specific note and indicates the probability value of that note being played in the received audio signal 100 at the time corresponding to the frame.
[0024] Figure 2This is a block diagram depicting an example of a frame and its window set for probability tuples in the implementation scheme. FRAME_T0 210 is the frame corresponding to duration T0. Based at least on the spectrogram data of the corresponding frame at T0, the event probability generation box 150 has generated the probability tuples for FRAME_T0 210. FRAME_T0 210 contains probability tuples for various notes (A0, G0...C8). FRAME_T0 210 can contain probability tuple values for all possible notes for a piano or any other instrument. Each example probability tuple contains a note-on probability (indicating whether the corresponding note is being played during duration T0) and a note-start probability value (indicating whether the corresponding note begins playing during duration T0). FRAME_T0 210 indicates that the probability of the G0 note being started and played is high during frame duration T0.
[0025] Despite Figure 2 Only FRAME_T0 is depicted, but window set 200 contains frames for other consecutive durations (T0-T99) used for the example window. Therefore, window set 200 contains a set of probability tuples for notes arranged from FRAME_T0 210 to FRAME_T99 299 for duration T99.
[0026] Generate musical score Continue to refer to Figure 1 In the implementation, the score generation box 170 uses a window set of frames of probability tuple 160 to generate an event (if any) for each note of each frame.
[0027] Figure 3 This is a block diagram depicting the process for generating note events in the implementation scheme. At step 300, the process receives a first window set of frames containing probability tuples of notes. The process can begin processing the first window set of frames while the probability tuples of the first window set are being generated and subsequent frames of the audio signal (for the next window set) are still being sampled in real time. Because this is the first window set of the audio signal used for real-time capture, no notes have yet been played, and therefore, the process initializes the set of previously played notes to zero.
[0028] At step 302, in the implementation scheme, the process filters out inaccurately generated probability tuples from the window set in edge frames. The event probability generation box 150 may introduce inaccuracies in frames that have less information about their neighboring frames. Such frames are referred to herein as "edge frames." At step 302, edge frames are filtered out from the window set of frames containing the probability tuples to generate a filtered window set.
[0029] For example, when using a CNN model to generate event probability boxes 150, each convolution introduces inaccuracies that depend on the width of the kernel used in the convolution operation (corresponding to the size of the frames in the input window set). For instance, in a CNN model that convolves with a kernel of width 5, followed by a kernel of width 3, then a kernel of width 5, then a kernel of width 7, then a kernel of width 7, then a kernel of width 7, and then a kernel of width 3, the number of inaccurate frames can be determined as follows: (5-1)+(3-2)+(5-1)+(7-1)+(7-1)+(7-1)+(3-1)=30 frames. Therefore, for such an example CNN, the edge frames would be configured as 15 leading edge frames and 15 trailing edge frames.
[0030] Figure 4 This is an example of a filtered set of windows. Figure 4 In this process, filtering Figure 2 Window set 200 is used to exclude trailing and leading edge frames used in the example CNN above. Therefore, FRAME_T0 210 to FRAME_T14 414 are identified as 15 leading edges of window set 200, and FRAME_T85 495 to FRAME_T99 299 are identified as trailing edges of window set 200. These frames are then filtered out to obtain the filtered window set 400.
[0031] Continue to refer to Figure 3 The process iterates through the probability tuples of each note in each frame to determine whether any note event has occurred. At step 305, frames are selected from the window set of filtered frames, and at step 310, the probability tuples of notes in the selected frames are chosen.
[0032] At step 315, the process evaluates whether the selected note meets the criteria for being on (in the note-on state). In the implementation, the process selects a note-on probability from a selected probability tuple, which indicates the probability that the selected note is being played, and compares this probability value with a pre-configured threshold. If the probability value is higher than the pre-configured threshold for note on, the process determines that the selected note is currently being played, i.e., in the note-on state. If the probability value is lower than the pre-configured threshold for note on, the process determines that the selected note is not currently being played, i.e., in the note-off state.
[0033] In other implementations, the probability value of the selected note is determined by the probability value of the selected frame and additionally by the probability values of one or more adjacent frames. Multiple probability values can be aggregated using one or more aggregation functions (e.g., weighted averages), where probability values of frames that are closer in time are assigned higher weights than those of frames that are farther in time. Alternatively, probability values of other notes used for the same or different frames can be used. Probability values can be aggregated using one or more aggregation functions (e.g., weighted averages), where probability values of more similar notes are assigned higher weights than those of less similar notes.
[0034] Alternatively, pre-configured thresholds can be configured based on the type of audio source. Specifically, different thresholds can be assigned to different instruments and vocal sources for different notes. In such implementations, to determine whether a note-on criterion is met, the process can obtain the type of audio used for the window set and the corresponding threshold for that audio type and / or for a specific selected note.
[0035] Continue to refer to Figure 2 If the process determines that the selected note for the selected frame meets the note-on criteria, the process proceeds to step 320. At step 320, the process identifies the previously determined note-on state of the selected note. The process maintains a set of notes that were in a note-on state in the previous iteration. The term "previously note-on set" refers to this set of notes, and this set can change with each iteration of the frame / note.
[0036] For the first frame of the first filtered window set, the previous note-on set is empty because it can be assumed that no music was played before the first frame of the first filtered window set. Otherwise, the previous note-on set contains notes played in the previous frame as determined using the techniques described herein.
[0037] If, at step 320, the process identifies that the selected note in the selected frame was previously off, then the process determines that the note's state has changed from off to on.
[0038] In one implementation, the process determines at step 343 whether the note has reached the minimum number of frames in the note-on state. To avoid the note-on state being too short-lived, in such implementations, the process retrieves a minimum frame note-on threshold. The process retrieves the same number (or one less) of future frames for the selected note. If the future note meets the note-on criteria as described herein, the process generates a note-on event at step 345 and stores it in MIDI format. The process also adds the selected note to the previous note-on set and / or increments a counter for the frames in which the selected note is in the note-on state.
[0039] The pre-configured threshold for the minimum number of frames in a note-on state can be configured based on the type of audio source. Specifically, different thresholds can be assigned to different instruments and vocal sources for different notes. In such implementations, to determine whether the note-on criterion is met, the process can obtain the type of audio for the window set and the corresponding threshold for that audio type and / or for a specific selected note.
[0040] Otherwise, if the minimum note-on state threshold is not met at step 343, in such an implementation, the process skips generating the note-on event and proceeds to step 305 and / or 310 to select the next note / frame.
[0041] On the other hand, the selected note may have already been played in a previous frame, i.e., it may be in a note-on state. Therefore, if at step 320 the process identifies that the selected note is in the previous note-on set, the process proceeds to step 325. At step 325, the process determines whether this is a new start for the same note or a continuation of the same note-on state. At step 325, the process evaluates whether the note-on probability of the selected probability tuple meets the note-on criterion.
[0042] In the implementation, the process selects a note start probability from a selected probability tuple, indicating the probability that the selected note will be (re)started in the selected frame. The process compares the note start probability value with a pre-configured threshold for note start. If the probability value is higher than the pre-configured threshold, the process proceeds to step 330 and determines that the selected note is being replayed in the current state. The selected note should then be in a note-off state and transition back to a note-on state.
[0043] If the probability value is lower than the pre-configured threshold for note initiation, the process determines that the selected note is played continuously; that is, the selected note remains in the note-on state. Therefore, at step 325, the note initiation criterion is not met, and no event is generated because the note is already in the note-on state. The process can increment the frame count for the note-on state of the selected note.
[0044] In other implementations, the note-start probability value of the selected note is determined by the probability value of the selected frame and additionally by the probability values of one or more adjacent frames. Multiple probability values can be aggregated using one or more aggregation functions (e.g., weighted averages), where probability values of frames that are closer in time are assigned higher weights than those of frames that are farther in time. Alternatively, probability values of other notes from the same or different time frames can be used. Probability values can be aggregated using one or more aggregation functions (e.g., weighted averages), where probability values of more similar notes are assigned higher weights than those of less similar notes.
[0045] Alternatively, the pre-configured threshold for note onset can be configured based on the type of audio source. Specifically, different thresholds can be assigned to different instruments and vocal sources for different notes. In such implementations, to determine whether a note onset criterion is met, the process can obtain the type of audio used for the window set and the corresponding threshold for that audio type and / or for a specific selected note.
[0046] Continue to refer to Figure 3 When the process determines that the selected note in the selected frame has been played, it proceeds to step 330. At step 330, in one implementation, the process determines whether the note has been in the note-on state for a minimum number of frames. To avoid the note-on state lasting too short a time before turning off, in such implementations, the process tracks the number of frames in which notes from the previous note-on set have been in the note-on state. At step 340, the process retrieves a minimum note-on state threshold and compares it to the count obtained for the selected note. If the frame count is below the threshold, no new event needs to be generated. And since the note continues to be in the note-on state, the note-on state frame count can be incremented for that note.
[0047] Otherwise, if the count exceeds the minimum note-on threshold, a note-off event is generated in step 340. The note is removed from the previous note-on set, and / or the frame count for consecutive note-on states is reset for that note. Alternatively, the process may further generate note-on events and / or wait for the process to determine if a possible note-on event exists in the next frame. The generated event data may be stored in MIDI format.
[0048] At step 315, the note-on criterion fails to meet the note-on criterion for the selected note in the selected frame. Therefore, the process determines that the selected note is in a note-off state for the selected frame. The process proceeds to step 335 and evaluates whether the selected note was previously in the same note-off state. If the note is not in the previous note-on set, no event is needed because the selected note continues to remain in the note-off state. Otherwise, at step 350, the process generates a note-off event for the selected note and stores the event in MIDI format.
[0049] This process can iterate for each note in the selected frame and each frame of the filtered window set. Figure 2 The steps.
[0050] Process subsequent window sets Figure 5 It is a block diagram depicting the process of determining the digital note representation for the next window set in the implementation scheme.
[0051] At step 500, the process receives the next window set of the frame containing the probability tuple. Simultaneously, subsequent frames of the audio signal are being sampled in real-time to generate the next window set. In this implementation, the next frame of the probability tuple overlaps with the frames of the previously processed window set. In other words, the next window set of the frame overlaps temporally with the previous window set of the frame. As described above, to generate accurate note events in real-time, a subset of the window set of the frame, i.e., a filtered window set, is used. To have consecutive filtered window sets of frames, the frames at the trailing edge of the previous window set overlap with the frames of the next filtered window set, and the frames at the leading edge of the next window set overlap with the frames of the previous filtered window set.
[0052] Figure 6 This is a block diagram depicting an example of a sequential window set. Window set 600 is within window set 200 (also in...). Figure 2 and Figure 4 The next window set in time follows (as depicted in the image). Window set 600 begins with the frame corresponding to duration T70 instead of T100 (the frame after the last frame of the previous window set 200). By selecting processed frames T70 to T99 for the next window set 600, this process ensures that the filtered window set 400 of the previous window set 200 immediately follows the next filtered window set 610 of the next window set 600. This selection of the next window set also reduces the time delay in real-time processing from 100 frames to 70 frames, which is the size of the filtered window set.
[0053] Continue to refer to Figure 5 In step 510, the method used is... Figure 2 The technique described in step 302 is used to filter frames of the next window set of frames of probability tuples.
[0054] At step 520, the process determines the notes that are in the note-on state after the previous filtered window set has been processed. In the implementation, the previous note-on set updated by the last iteration of the previous filtered window set is initialized as the previous note-on set for the first frame of the next filtered window set.
[0055] After determining the set of previously opened notes for the first frame of the next filtered window set, the process transitions to... Figure 3 Step 305, and perform event detection and generation for each note of each frame in the next filtered window set.
[0056] Store the generated musical score At the end of processing each filtered window set, a set of events is obtained for the frame detection of the filtered window set. The corresponding event data is stored in a medium in MIDI format. (Continue to refer to...) Figure 1 MIDI event data 180 is generated by adding a time stamp to each generated event, corresponding to the frame for which the event was generated. The time stamp can be calculated based on the frame number and the configured frame duration. The time stamp can use the start time stamp of the frame, the end time stamp of the frame, or any other time stamp corresponding to the duration of the frame.
[0057] Training machine learning models Machine learning techniques involve applying machine learning algorithms to a training dataset where the results are known, with initialized parameters whose values are modified in each training iteration to more accurately produce the known results (referred to as "labels" in this paper). Based on such applications, this technique generates machine learning models with known parameters. Therefore, a machine learning model comprises a model data representation or model parameters. Model parameters consist of parameter values that are applied to the input by the machine learning algorithm to generate predicted outputs. Training a machine learning model requires determining the parameter values of the model parameters. The structure and organization of the parameter values depend on the machine learning algorithm.
[0058] Therefore, the term "machine learning algorithm" (or simply "algorithm") in this paper refers to the set of procedures or rules to be followed in computation, where the model parameters (including one or more parameters) used for computation are unknown. The term "machine learning model" (or simply "model") in this paper refers to the set of procedures or rules to be followed in computation, where the model parameters (including one or more parameters) are known and have been derived based on training the corresponding machine learning algorithm using one or more training datasets. Once training is complete, the input is applied to the machine learning model to make predictions, which may also be referred to as the prediction result or output in this paper.
[0059] In supervised training, training data is used by a supervised training algorithm to train a machine learning model. Training data includes inputs and "known" outputs, i.e., labels. In implementations, the supervised training algorithm is an iterative process. In each iteration, the machine learning algorithm applies the model parameters and inputs to generate a predicted output. The error or variance between the predicted output and the known output is calculated using an objective function. In effect, the output of the objective function indicates the accuracy of the machine learning model based on a specific state of the model parameters during the iteration. The parameter values of the model are adjusted by applying an optimization algorithm based on the objective function. Iterations can be repeated until the desired accuracy is achieved or some other criterion is met.
[0060] In the implementation, to iteratively train the algorithm to generate a trained model, a training dataset can be arranged such that each row of the dataset is the input to the machine learning algorithm and further stores the corresponding actual result, i.e., the label value, for that row. For example, each row of the adult income dataset represents a specific adult whose results are known, such as whether the adult's total income exceeds $500,000. Each column of the adult training dataset contains a numerical representation of specific adult characteristics (e.g., whether the adult has a college degree, the adult's age, etc.). Based on this, when the algorithm is trained, it can accurately predict whether the total income of any adult (even those not described in the training dataset) exceeds $500,000.
[0061] The row values of the training dataset can be fed as input to a machine learning algorithm and modified based on one or more parameters of the algorithm to produce predictions. The predicted row values are compared to the label values, and an error value is calculated based on the difference. One or more error values of a batch of rows are used in a statistical aggregation function to calculate the batch error value. The term "loss" refers to the error value of a batch of rows.
[0062] In each training iteration, a corresponding loss value is calculated based on one or more predicted values. For the next training iteration, one or more parameters are modified based on the current loss to reduce the loss. Any number of iterations can be performed on the training dataset to reduce the loss. Training iterations using the training dataset can be stopped when the loss variation between iterations is within a threshold. In other words, iteration stops when the loss across different iterations is substantially the same.
[0063] After training iterations, the generated machine learning model includes a machine learning algorithm with model parameters that produce the minimum loss.
[0064] For example, the aforementioned adult income dataset can be iteratively trained using a Support Vector Machine (SVM) algorithm to train an SVM-based model for the adult income dataset. Each row of the adult dataset is fed as input to the SVM algorithm, and the result of the SVM algorithm, i.e., the predicted result, is compared with the actual result for that row to determine the loss. Based on the loss, the parameters of the SVM are modified. The next row is then fed to the SVM algorithm with the modified parameters to produce the predicted result for the next row. This process can be repeated until the difference between the loss values of the previous iteration and the current iteration is below a predefined threshold, or in some implementations, until the difference between the minimum loss value reached and the loss of the current iteration is below a predefined threshold.
[0065] Once the machine learning model for the machine learning algorithm is determined, new datasets with unknown results can be used as input to the model to compute predictions for the new dataset.
[0066] In a software implementation, when a machine learning model is referred to as receiving input, executing, and / or generating output or prediction, the computer system process executing the machine learning algorithm applies model parameters to the input to generate the predicted output. The computer system process executes the machine learning algorithm by executing software configured to cause the algorithm to run.
[0067] Machine learning algorithms and domains Machine learning algorithms can be selected based on the domain of the problem and the type of expected output required. Non-restrictive examples of algorithm output types could be discrete values for problems in the classification domain, continuous values for problems in the regression domain, or anomaly detection problems in the clustering domain.
[0068] However, even for a specific domain, there are many algorithms to choose from to select the most accurate one for a given problem. As a non-restrictive example, in the classification domain, support vector machines (SVM), random forests (RF), decision trees (DT), Bayesian networks (BN), randomized algorithms such as genetic algorithms (GA), or connectionist topologies such as artificial neural networks (ANN) can be used.
[0069] Machine learning implementations can rely on matrices, symbolic models, and hierarchical and / or associative data structures. Parameterized (i.e., configurable) implementations of best-practice machine learning algorithms can be found in open-source libraries, such as Google's TensorFlow for Python and C++, or Georgia Institute of Technology's MLPack for C++. Shogun is an open-source C++ ML library that supports multiple programming languages, including C#, Ruby, Lua, Java, MatLab, R, and Python.
[0070] Hyperparameters, cross-validation, and algorithm selection A single machine learning algorithm type can have an infinite number of variations based on one or more hyperparameters. The term "hyperparameter" refers to a parameter in the model parameters that is set before the machine learning algorithm model is trained and remains unchanged during training. In other words, hyperparameters are constant values that influence (or control) the generated trained model, independent of the training dataset. Machine learning models with only hyperparameter values set are referred to in this paper as "variants of machine learning algorithms," or simply "variants." Therefore, different hyperparameter values for the same type of machine learning algorithm can produce significantly different loss values on the same training dataset during model training.
[0071] For example, the SVM machine learning algorithm includes two hyperparameters: "C" and "gamma". The "C" hyperparameter can be set from 10... -3 Up to 10 5 Any value, while the "gamma" hyperparameter can be set from 10... -5 Up to 10 3 Therefore, the "C" and "gamma" parameters have an infinite number of permutations, which may produce different loss values for training the same adult income training dataset.
[0072] Therefore, in order to select an algorithm type, or further, to select the best-performing algorithm variant, various hyperparameter selection techniques are used to generate different sets of hyperparameter values. Non-limiting examples of hyperparameter value selection techniques include: Bayesian optimization (e.g., Gaussian processes used for hyperparameter value selection), stochastic search, gradient-based search, grid search, manual tuning techniques, and techniques based on tree-structured Parzen estimators (TPEs).
[0073] Each machine learning algorithm variant is trained on a training dataset by selecting different sets of hyperparameter values based on one or more of these techniques. A test dataset is used as input to the trained model to compute predicted values. The predicted values are compared to their corresponding label values to determine a performance score. The performance score can be calculated based on the error rate of the predicted values relative to their corresponding labels. For example, in a classification domain, if only 9,000 out of 10,000 inputs to the model match the labels used for input, the performance score is calculated as 90%. In non-classification domains, the performance score can be further based on a statistical aggregation of the differences between the label values and the predicted values.
[0074] The term "trial" in this paper refers to training a machine learning algorithm using a different set of hyperparameter values and testing the algorithm using at least one test dataset. In the implementation, cross-validation techniques, such as k-fold cross-validation, are used to create multiple pairs of training and test datasets from the original training dataset. Each pair of datasets together contains the original training dataset, but these pairs partition the original dataset between the training and test datasets in different ways. For each pair of datasets, the training dataset is used to train the model based on the selected set of hyperparameters, and the corresponding test dataset is used to compute prediction values with the trained model. Based on inputting the test dataset into the trained machine learning model, a performance score for that pair (or fold) is calculated. If there are more than one pair (i.e., folds), the performance scores are statistically aggregated (e.g., mean, minimum, maximum) to produce a final performance score for a variant of the machine learning algorithm.
[0075] Each trial is computationally very expensive because it involves multiple training iterations of variants of the machine learning algorithm to generate performance scores for a set of different hyperparameter values. Therefore, reducing the number of trials can significantly reduce the computational resources (e.g., processor time and cycles) required for tuning.
[0076] Furthermore, since the performance score is generated to select the most accurate algorithm variant, the more accurate the performance score itself, the more accurate the generated model's predictions will be relative to other variants. In fact, once the machine learning algorithm and its hyperparameter-based variants are selected, the machine model is trained by applying the algorithm variants to the full training dataset using the techniques described above. It is expected that this generated machine learning model will predict outcomes more accurately than machine learning models of any other variant of the algorithm.
[0077] The accuracy of the performance score itself depends on how much computational resources are spent tuning the hyperparameters of the algorithm. Computational resources may be wasted on a set of hyperparameter values that the tests do not produce the accuracy expected by the final model.
[0078] Similarly, for algorithms whose accuracy may be lower than other types of algorithms, less (or no) computational resources can be spent tuning these hyperparameters. Therefore, the number of trials for the hyperparameters of discounted algorithms can be reduced or eliminated, thereby significantly improving the performance of the computer system.
[0079] Software Overview Figure 7 It can be used for control Figure 8A block diagram of the basic software system 700 for the operation of the computing system 800. The software system 700 and its components, including their connections, relationships, and functions, are intended only as examples and not as limiting the implementation of the example scheme. Other software systems suitable for implementing the example scheme may have different components, including components with different connections, relationships, and functions.
[0080] A software system 700 is provided to guide the operation of the computing system 800. The software system 700 may be stored in system memory (RAM) 806 and fixed storage (e.g., hard disk or flash memory) 810, including a kernel or operating system (OS) 710.
[0081] OS 710 manages the low-level aspects of computer operations, including managing process execution, memory allocation, file input and output (I / O), and device I / O. One or more applications (represented as 702A, 702B, 702C...702N) can be "loaded" (e.g., transferred from fixed storage 810 to storage 806) for execution by system 700. Applications or other software intended for use on computer system 800 can also be stored as a set of downloadable computer-executable instructions, for example, for downloading and installing from an Internet location (e.g., a web server, app store, or other online service).
[0082] Software system 700 includes a graphical user interface (GUI) 715 for receiving user commands and data graphically (e.g., "click" or "touch gestures"). These inputs can then be executed by system 700 according to instructions from operating system 710 and / or application 702. GUI 715 also displays the results of operations from OS 710 and application 702, after which the user can provide additional input or terminate the session (e.g., log off).
[0083] OS 710 can be executed directly on the bare hardware 720 of computer system 800 (e.g., processor 804). Alternatively, a hypervisor or virtual machine monitor (VMM) 730 can be inserted between the bare hardware 720 and OS 710. In this configuration, VMM 730 acts as a software "buffer" or virtualization layer between OS 710 and bare hardware 720 of computer system 800.
[0084] VMM 730 instantiates and runs one or more virtual machine instances (“guests”). Each guest includes a “guest” operating system (e.g., OS 710) and one or more applications designed to execute on the guest operating system (e.g., application 702). VMM 730 presents a virtual operating platform to the guest operating system and manages the execution of the guest operating system.
[0085] In certain situations, VMM 730 can allow a guest operating system to run as if it were running directly on the bare hardware 720 of the computer system 800. In these cases, the same version of the guest operating system configured to run directly on the bare hardware 720 can also run on VMM 730 without modification or reconfiguration. In other words, in some situations, VMM 730 can provide complete hardware and CPU virtualization to the guest operating system.
[0086] In other cases, the guest operating system can be specifically designed or configured to run on the VMM 730 for improved efficiency. In these cases, the guest operating system is "aware" that it is running on the virtual machine monitor. In other words, under certain circumstances, the VMM 730 can provide paravirtualization to the guest operating system.
[0087] Computer system processes involve the allocation of hardware processor time and memory (physical and / or virtual). Memory allocation is used to store instructions executed by the hardware processor, to store data generated by the execution of those instructions, and / or to store hardware processor state (e.g., register contents) between hardware processor time allocations when the computer system process is not running. Computer system processes run under the control of the operating system and can also run under the control of other programs executing on the computer system.
[0088] Multiple threads can run within a single process. Each thread also includes an allocation of hardware processing time, but shares access to the memory allocated to that process. When a thread is not running, memory is used to store processor contents between allocations. The term "threads" can also be used to refer to a computer system process that includes multiple threads that are not running.
[0089] cloud computing The term "cloud computing" is generally used in this article to describe a computing model that enables on-demand access to a shared pool of computing resources, such as computer networks, servers, software applications, and services, and allows for the rapid provisioning and release of resources with minimal management effort or service provider interaction.
[0090] Cloud computing environments (sometimes called cloud environments or the cloud) can be implemented in various ways to best meet different needs. For example, in a public cloud environment, the underlying computing infrastructure is owned by an organization that makes its cloud services available to other organizations or the general public. In contrast, private cloud environments are typically designed for use by or within a single organization. Community clouds are designed to be shared by several organizations within a community; while hybrid clouds include two or more types of clouds (e.g., private, community, or public) that are bound together by data and application portability.
[0091] Typically, cloud computing models shift responsibilities that were previously handled by an organization's own IT department to consumers as service layers within the cloud environment (which may be internal or external to the organization, depending on the public / private nature of the cloud). The precise definition of the components or functions provided by or within each cloud service layer can vary depending on the specific implementation, but common examples include: Software as a Service (SaaS), where consumers use software applications running on cloud infrastructure, while the SaaS provider manages or controls the underlying cloud infrastructure and applications; Platform as a Service (PaaS), where consumers can use software programming languages and development tools supported by the PaaS provider to develop, deploy, and otherwise control their own applications, while the PaaS provider manages or controls other aspects of the cloud environment (i.e., everything below the runtime execution environment); and Infrastructure as a Service (IaaS), where consumers can deploy and run arbitrary software applications and / or configure processing, storage, networking, and other basic computing resources, while the IaaS provider manages or controls the underlying physical cloud infrastructure (i.e., everything below the operating system layer). Database as a Service (DBaaS) is a service where consumers use database servers or database management systems running on cloud infrastructure, while the DBaaS provider manages or controls the underlying cloud infrastructure, applications, and servers, including one or more database servers. In a cloud computing environment, deep insights into applications or application data are not readily available. For planned operations requiring disconnection, the techniques discussed in this article allow the session to be released and then rebalanced later without interrupting the application.
[0092] The basic computer hardware and software, as well as the cloud computing environment described above, are presented to illustrate the basic underlying computer components that can be used to implement the example implementations. However, the example implementations are not necessarily limited to any particular computing environment or computing device configuration. Rather, those skilled in the art will understand from this disclosure that the example implementations can be implemented in any type of system architecture or processing environment capable of supporting the functionality and features of the example implementations described herein.
[0093] Hardware Overview According to one implementation, the techniques described herein are implemented by one or more dedicated computing devices. These dedicated computing devices may be hardwired to execute these techniques, or may include digital electronic devices, such as one or more application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs), persistently programmed to execute these techniques, or may include one or more general-purpose hardware processors programmed to execute these techniques according to program instructions in firmware, memory, other memory, or a combination thereof. Such dedicated computing devices may also combine custom hardwired logic, ASICs, or FPGAs with custom programming to implement these techniques. Dedicated computing devices may be desktop computer systems, portable computer systems, handheld devices, network devices, or any other device that includes hardwired and / or program logic to implement these techniques.
[0094] For example, Figure 8 This is a block diagram illustrating a computer system 800 upon which an implementation of the present invention can be based. The computer system 800 includes a bus 802 or other communication mechanism for communicating information, and a hardware processor 804 coupled to the bus 802 for processing information. The hardware processor 804 may be, for example, a general-purpose microprocessor.
[0095] Computer system 800 also includes main memory 806, such as random access memory (RAM) or other dynamic storage device, coupled to bus 802, for storing information and instructions to be executed by processor 804. Main memory 806 can also be used to store temporary variables or other intermediate information during the execution of instructions to be executed by processor 804. Such instructions, when stored in a non-transitory storage medium accessible to processor 804, present computer system 800 as a dedicated machine tailored to perform the operations specified in the instructions.
[0096] The computer system 800 further includes a read-only memory (ROM) 808 or other static storage device coupled to a bus 802 for storing static information and instructions for the processor 804. A storage device 810, such as a magnetic disk or optical disk, is provided and coupled to the bus 802 for storing information and instructions.
[0097] Computer system 800 can be coupled to display 812, such as a cathode ray tube (CRT), via bus 802 for displaying information to the computer user. Input device 814, including alphanumeric keys and other keys, is coupled to bus 802 for communicating information and command selection to processor 804. Another type of user input device is cursor controller 816, such as a mouse, trackball, or cursor arrow keys, for communicating directional information and command selection to processor 804, and for controlling cursor movement on display 812. This input device typically has two degrees of freedom on two axes, a first axis (e.g., x) and a second axis (e.g., y), which allows the device to specify a position in a plane.
[0098] Computer system 800 may implement the techniques described herein using custom hardwired logic, one or more ASICs or FPGAs, firmware, and / or program logic. These techniques, combined with the computer system, cause or program the computer system 800 to become a special-purpose machine. According to one implementation, the techniques of this invention are executed by computer system 800 in response to processor 804 executing one or more sequences of one or more instructions contained in main memory 806. Such instructions may be read into main memory 806 from another storage medium, such as storage device 810. Executing the sequence of instructions contained in main memory 806 causes processor 804 to perform the procedural steps described herein. In alternative implementations, hardwired circuitry may be used in place of or in combination with software instructions.
[0099] As used herein, the term "storage media" refers to any non-transitory medium that stores data and / or instructions to cause a machine to operate in a particular manner. Such storage media can include non-volatile media and / or volatile media. Non-volatile media include, for example, optical discs or magnetic disks, such as storage device 810. Volatile media include dynamic memory, such as main memory 806. Common forms of storage media include, for example, floppy disks, floppy disks, hard disks, solid-state drives, magnetic tape or any other magnetic data storage media, CD-ROMs, any other optical data storage media, any physical media with a perforated pattern, RAM, PROMs and EPROMs, FLASH-EPROMs, NVRAMs, any other memory chips or cartridges.
[0100] Storage media differ from transmission media, but can be used in conjunction with them. Transmission media participate in the transfer of information between storage media. For example, transmission media include coaxial cables, copper wires, and optical fibers, including the lines that constitute bus 802. Transmission media can also take the form of sound waves or light waves, such as those generated during radio wave and infrared data communication.
[0101] Various forms of media may be involved in transmitting one or more sequences of one or more instructions to processor 804 for execution. For example, instructions may initially be carried on a disk or solid-state drive of a remote computer. The remote computer may load the instructions into its dynamic memory and transmit them over a telephone line using a modem. A modem local to computer system 800 may receive data over the telephone line and convert the data into an infrared signal using an infrared transmitter. An infrared detector may receive the data carried in the infrared signal, and appropriate circuitry may place the data on bus 802. Bus 802 transmits the data to main memory 806, from which processor 804 retrieves and executes the instructions. Instructions received in main memory 806 may be selectively stored on storage device 810 before or after execution by processor 804.
[0102] Computer system 800 also includes a communication interface 818 coupled to bus 802. Communication interface 818 provides bidirectional data communication coupling to a network link 820 connected to a local network 822. For example, communication interface 818 may be an Integrated Services Digital Network (ISDN) card, cable modem, satellite modem, or modem to provide data communication connectivity to a corresponding type of telephone line. As another example, communication interface 818 may be a Local Area Network (LAN) card to provide data communication connectivity to a compatible LAN. A wireless link may also be implemented. In any such implementation, communication interface 818 transmits and receives electrical, electromagnetic, or optical signals carrying digital data streams representing various types of information.
[0103] Network link 820 typically provides data communication to other data devices via one or more networks. For example, network link 820 may provide connectivity to host 824 or to data devices operated by Internet Service Provider (ISP) 826 via local network 822. ISP 826, in turn, provides data communication services via a global packet data communication network now commonly referred to as the "Internet" 828. Both local network 822 and Internet 828 use electrical, electromagnetic, or optical signals that carry digital data streams. Signals through various networks, as well as signals on network link 820 and through communication interface 818 (which transmits digital data to and from computer system 800), are example forms of transmission media.
[0104] Computer system 800 can send messages and receive data, including program code, via a network, network link 820, and communication interface 818. In the Internet example, server 830 can transmit requested code for an application via the Internet 828, ISP 826, local network 822, and communication interface 818.
[0105] The received code can be executed by processor 804 upon receipt and / or stored in storage device 810 or other non-volatile memory for later execution.
[0106] compute nodes and clusters A compute node is a combination of one or more hardware processors, each sharing access to byte-addressable memory. Each hardware processor is electrically coupled to registers on the same chip as the hardware processor and is capable of executing instructions that reference memory addresses in the addressable memory, causing the hardware processor to load data at that memory address into any register. Additionally, one or more hardware processors can access their own dedicated memory, which cannot be accessed by other processors. One or more hardware processors can operate under the control of the same operating system.
[0107] A hardware processor may include multiple core processors on the same chip, each core processor (“core”) capable of executing machine code instructions independently within the same clock cycle as another of the multiple cores. Each core processor may be electrically coupled to a temporary memory that cannot be accessed by any of the other core processors in the multi-core processor.
[0108] A cluster comprises compute nodes that communicate with each other over a network. Each node in the cluster can be coupled to a network interface card (NIC) or network integrated circuit (NIC) on the same board as the compute node. Network communication between any two nodes occurs through the NIC or NIC on one node and the NIC or NIC on the other node. The network can be configured to support remote direct memory access.
[0109] In the foregoing specification, numerous specific details have been described with reference to various implementations, which may vary depending on the implementation. Therefore, the specification and drawings should be considered illustrative rather than restrictive. The unique and exclusive indication of the scope of this invention, and the content intended by the applicant as the scope of this invention, is the literal and equivalent scope of the claims set forth in this application, subject to the specific form adopted by the claims, including any subsequent amendments.
Claims
1. A computer-implemented method, comprising: Receive the first sample sequence of the audio stream; While receiving the next sample of the first sample sequence that is later in time than the audio stream: A first window set of note event probability values is generated, at least in part, based on the first sample sequence; Exclude from the first window set of the note event probability values: the first leading set of note event probability values corresponding to the first leading edge of the sample of the first sample sequence, and the first trailing set of note event probability values corresponding to the first trailing edge of the sample of the first sample sequence, thereby generating a first filtered window set of note event probability values. The first leading edge of the sample in the first sample sequence includes a plurality of initial sample sequences of the first sample sequence, and the first trailing edge of the sample in the first sample sequence includes a plurality of last sample sequences of the first sample sequence. A first sequence set of note events is determined, at least in part, based on a first filtered window set of the note event probability values.
2. The method according to claim 1, further comprising: For each frame of the first window set of the note event probability values, it is determined, at least in part, based on one or more previous frames for a particular note in the first window set of the note event probability values, whether a note-on event or a note-off event for the particular note has been detected.
3. The method according to claim 1, further comprising: For a specific frame of the first window set of the note event probability values, a note-on event of a specific note is determined to be detected, based at least in part on the fact that the probability value for note-on has met the criteria for note-on state and that the specific note has a note-off state in the previous frame of the first window set of the note event probability values.
4. The method of claim 3, further comprising: For a specific frame of the first window set of the note event probability values, the detection of the note opening event of the specific note is determined at least in part based on the fact that the probability values for note opening in the next one or more frames of the first window set have met the criteria for note opening state.
5. The method of claim 1, further comprising: For a specific frame of the first window set of the note event probability values, the note-off event that detected a specific note is determined at least in part based on the following: a) The probability value used for note activation already meets the note activation criteria for the note activation state. b) The probability value used for the note having started already meets the criterion for the note having started, and c) A minimum number of previous frames of the specific frame have a note-on state.
6. The method of claim 1, further comprising: Receive a second sample sequence of the audio stream; While receiving the next sample of the second sample sequence that is later in time than the audio stream: A second window set of note event probability values is generated, based at least in part on the second sample sequence and the several last sample sequences of the first sample sequence; Excluded from the second window set of the note event probability values: a second leading set of note event probability values corresponding to samples preceding the plurality of last sample sequences of the first sample sequence, and a second trailing set of note event probability values corresponding to trailing edges of samples in the second sample sequence. The note event probability values corresponding to the trailing edges of the samples in the first sample sequence are included from the second window set of the note event probability values, and This generates a second filtered window set of note event probability values; A second set of sequence of note events is determined, at least in part, based on a second filtered window set of the note event probability values.
7. The method of claim 1, wherein, The first window set for generating the probability values of the note events, at least in part based on the first sample sequence, includes: For each sample frame in the first sample sequence in the time domain, each sample frame is transformed into a frame with the corresponding frequency component value in the frequency domain, thereby generating a first sequence of frames with frequency component values. A first window set of the note event probability values is generated, based at least in part on a first sequence of frames containing the frequency component values.
8. The method according to claim 1, wherein, The first window set for generating the probability values of the note events, at least in part based on the first sample sequence, includes: Use one or more machine learning (ML) models to generate a first window set of the probability values for the note events.
9. The method according to claim 8, wherein, The one or more ML models are calibrated at least in part based on one or more of the following: the number of frames in the window set, the number of samples in the frame, the number of trailing sample sets, or the number of leading sample sets.
10. The method of claim 1, further comprising: Based at least in part on a first filtered window set of the note event probability values, determine one or more notes that are in the note-on state for the last frame of the first filtered window set. A second filtered window set of note event probability values is generated, based at least in part on the second sample sequence and the several last sample sequences of the first sample sequence. Based on the first frame of the second filtered window set with event probability values, determine that at least one note among the one or more notes that were in the note-on state in the last frame of the first filtered window set is in the note-off state. A note-off note event is generated for the at least one note of the first frame of the second filtered window, based at least in part on determining that the at least one note is in a note-off state.
11. One or more non-transitory storage media storing instructions that, when executed by one or more hardware processors, cause the processor to perform the method of any one of claims 1-10.
12. An apparatus comprising a mechanism for performing the method according to any one of claims 1-10.
13. A system comprising: processor; as well as A memory coupled to the processor and including instructions stored thereon, which, when executed by the processor, cause the processor to perform the method of any one of claims 1-10.