Systems, apparatuses, and methods for performing adaptive audio mixing are disclosed. A trained neural network dynamically selects and mixes pre-recorded, human-composed music stems that are composed as mutually compatible sets. Stem and track selection, volume mixing, filtering, dynamic compression, acoustical/reverberant characteristics, segues, tempo, beat-matching and crossfading parameters generated by the neural network are inferred from the game scene characteristics and other dynamically changing factors. The trained neural network selects an artist's pre-recorded stems and mixes the stems in real-time in unique ways to dynamically adjust and modify background music based on factors such as game scenario, the unique storyline of the player, scene elements, the player's profile, interest, and performance, adjustments made to game controls (e.g., music volume), number of viewers, received comments, player's popularity, player's native language, player's presence, and/or other factors. The trained neural network creates unique music that dynamically varies according to real-time circumstances.