Open-end audio tracking system via audio infrastructure and large-scale language models

JP2026100833APending Publication Date: 2026-06-19ROBERT BOSCH GMBH

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
ROBERT BOSCH GMBH
Filing Date
2025-12-09
Publication Date
2026-06-19

Smart Images

  • Figure 2026100833000001_ABST
    Figure 2026100833000001_ABST
Patent Text Reader

Abstract

A method for implementing an open-ended audio tracking system is disclosed. [Solution] A feedback loop between an Audio Foundation Model (AFM) and a Large-Scale Language Model (LLM) enables both real-time detection of low-level sound events and detection of high-level acoustic scenes, which are used to generate additional text-based event descriptions that are applied in subsequent iteration cycles of the system. The AFM is often analogous to a Contrastive Language Audio Pre-trained (CLAP) model configured to detect sound events, while the LLM receives the specific sound events it detects and categorizes these events into acoustic sound categories that describe the environmental context of the sound events.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present disclosure relates to a method and system for applying machine learning techniques to enable an audio tracking system.

Background Art

[0002] Background of the Invention Identifying the source of acoustic content from a recording device has previously relied on a pre-defined closed set of audio classes. This severely limits the capabilities of the algorithm as a given acoustic scene classifier is restricted to sound events that occur within the boundaries of the dataset it was trained on. Considering the diversity of sound events that occur across the various acoustic scenes that a person or machine may encounter, these devices quickly become impractical.

Summary of the Invention

Means for Solving the Problems

[0003] Summary of the Invention In one embodiment, a method is provided for performing an open-ended audio tracking system. The method includes providing an audio segment and a text-based description to an Audio Foundational Model (AFM), the text-based description corresponding to a description of an audio event detected by the AFM; running the AFM to detect a subset of audio events present in the audio segment; providing the corresponding subset of the text-based descriptions to a Large Language Model (LLM); and running the LLM, the LLM including classifying the audio segment into acoustic scene categories based on the detected subset of audio events, and generating additional text-based descriptions corresponding to other descriptions of audio events associated with the acoustic scene categories; and providing additional text-based descriptions to be used in further iterations of performing the open-ended audio tracking system.

[0004] In other embodiments, the system comprises a processor and memory containing instructions for causing the processor to perform these steps, when executed by the processor.

[0005] In other embodiments, the non-temporary computer-readable medium, when executed by a processor, includes instructions causing the processor to perform these steps. [Brief explanation of the drawing]

[0006] [Figure 1] This figure shows a system for training and utilizing machine learning models, such as convolutional neural networks, according to several embodiments. [Figure 2] This figure shows a computer implementation method for training and utilizing machine learning models, according to several embodiments. [Figure 3]This is a schematic diagram of an open-ended audio tracking system according to several embodiments. [Figure 4] This is a schematic diagram of a Contrastive Language-Audio Pre-training (CLAP) model for an open-end audio tracking system, based on several embodiments. [Figure 5A] This figure shows an exemplary first iteration of an open-ended audio tracking system performed by several embodiments. [Figure 5B] This figure shows an exemplary second iteration of the open-ended audio tracking system introduced in Figure 5A, according to several embodiments. [Figure 6] This flowchart illustrates the process of running an open-ended audio tracking system according to several embodiments. [Figure 7] This is a schematic diagram illustrating the interaction between a computer-controlled machine and a control system in several embodiments. [Figure 8] Figure 7 is a schematic diagram of a control system configured to control the amplifier and speaker of a hearing aid device, according to several embodiments. [Modes for carrying out the invention]

[0007] Detailed explanation Embodiments of the Disclosure are described herein. However, it should be understood that the embodiments disclosed are merely examples, and other embodiments may take various alternative forms. These drawings are not necessarily to scale, and some features may be exaggerated or minimized to show details of certain components. Accordingly, the specific structural and functional details disclosed herein should not be construed as limitations, but merely as representative grounds for teaching those skilled in the art how to use the embodiments in various ways. As those skilled in the art will understand, various features illustrated and described with reference to any one of the drawings can be combined with features shown in one or more other drawings to produce embodiments not expressly illustrated or described. Combinations of illustrated features provide representative embodiments for typical uses. However, various combinations and modifications of features consistent with the teachings of the Disclosure may be desired for specific uses or implementations.

[0008] As used herein, “a,” “an,” and “the” refer to both singular and plural objects unless otherwise explicitly indicated by the context. For example, “processor” programmed to perform various functions refers to either a single processor programmed to perform any and all functions, or two or more processors collectively programmed to perform each of various functions.

[0009] Two main technologies exist for identifying the source of audio content. The first technology is low-level sound event detection (SED), which aims to track basic sound elements over time, such as sirens, human speech, and dog barks. Practical applications of SED include automatically detecting or warning about specific incidents, such as gunfire or attacks on security cameras. The second technology is acoustic scene classification (ASC), which focuses on a higher level of understanding of a more comprehensive acoustic environment that may consist of multiple overlapping sounds. Practical applications of ASC include contextual awareness in smart devices or scene analysis in smart homes and cities.

[0010] However, past implementations of machine learning and modeling approaches for SED and ASC systems have relied on a predefined, closed set of audio classes. For example, a classifier built with 10 predefined classes cannot handle any sound events outside of this set. This limitation makes these models ineffective for managing the dynamic and complex audio environments encountered in the real world (e.g., scene transitions from indoors to outdoors). A major challenge is that scaling these models requires a substantial amount of labeled data and additional retraining or adaptive processes to achieve effective performance when new sound events need to be introduced. Consequently, this full-loop machine learning iteration cannot respond to any practical, business, or commercial needs at any point in near real-time.

[0011] To overcome these challenges, this disclosure designs a real-time, open-ended audio tracking system utilizing audio and language-based models. These models enable more universal and generalized performance across various downstream tasks. More specifically, audio-based models (AFMs) such as CLAP enable zero-shot audio classification or retrieval via intuitive free-form natural language queries without requiring a predefined closed set. Furthermore, LLMs such as GPT-4 enable high levels of inference, question answering, and knowledge summarization.

[0012] Unlike previous versions of AFM, which were limited to processing basic acoustic concepts such as individual sound events and therefore lacked the ability to perform complex scene inference and summarization, this disclosure applies a cascaded architecture of AFM and LLM, thereby enabling the open-ended audio tracking system to cooperatively perform both low-level audio signal perception and high-level acoustic scene inference. Thus, the open-ended audio tracking system is not limited in terms of either the audio class or acoustic scene on which the system can operate. Furthermore, the model is configured to operate in real time, thus enabling the open-ended audio tracking system to be incorporated into smart hearing aid devices and the like.

[0013] The following description continues with a general introduction to machine learning techniques related to methods for utilizing machine learning models such as those described herein. Next, various embodiments of architectures and process flows for cascading AFM and LLM for open-end audio tracking systems are described. Subsequently, this disclosure demonstrates the versatility of the methods and systems described herein for integration into hearing aid devices.

[0014] Figure 1 shows a system 100 for training and utilizing machine learning models such as convolutional neural networks, according to several embodiments.

[0015] With respect to Figures 1 and 2, the exemplary embodiments given in the following paragraphs of this specification refer to convolutional neural networks, but it should be understood that additional embodiments of Figures 1 and 2 may be applied to any other type of neural network-based or non-neural network-based machine learning model configured to be developed, trained, fine-tuned and / or performed for various applications of audio tracking and interpretation, which are further described herein.

[0016] Furthermore, Figures 1 and 2 relate to different time points earlier than those shown in Figures 3 through 8, for example, fully trained open-end audio tracking systems 300, AFM 306, open-end audio tracking system 500, and open-end audio tracking subsystem 714. The following paragraphs describe the training process of machine learning models such as AFMs and LLMs, so as to provide context for, for example, trained AFM 306 and LLM 310. In particular, the encoders used within the architecture of the AFMs described herein are flexible and can be configured to utilize different types of neural architectures, such as transformers or convolutional neural networks.

[0017] In some embodiments, system 100 may include an input interface for accessing a training data set 102 for a convolutional neural network. For example, as shown in FIG. 1, the input interface may be constituted by a data storage interface 104 that can access the training data 102 from the data storage 106. For example, the data storage interface 104 may be a memory interface or a persistent storage interface, such as a hard disk or SSD interface, but may also be a personal, local or wide area network interface such as a Bluetooth®, ZigBee or Wi-Fi interface or an Ethernet or fiber optic interface. The data storage 106 may be an internal data storage of the system 100 such as a hard drive or SSD, but may also be an external data storage, such as a network-accessible data storage.

[0018] In some embodiments, the data storage 106 may further include a data representation 108 of an untrained version of the model (e.g., a version of a machine learning model that has not yet been trained) that can be accessed by the system 100 from the data storage 106. However, it will be understood that the training data 102 and data representation 108 of the pre-trained convolutional neural network may also be accessed from different data storages, respectively, via different subsystems of the data storage interface 104. Each subsystem may be of the type described above for the data storage interface 104. In other embodiments, the data representation 108 of the pre-trained convolutional neural network may be generated internally by the system 100 based on the design parameters of the neural network and therefore may not be explicitly stored in the data storage 106. The system 100 may further include a processor subsystem 110 that can be configured to provide an iteration function in place of a stack of layers of the convolutional neural network being trained during the operation of the system 100. Here, each layer of the stack of layers to be replaced may have mutually shared weights and may receive, as input, the output of the previous layer, or, for the first layer of the stack of layers, the initial activation and a portion of the input of the stack of layers. The processor subsystem 110 may be further configured to iteratively train and / or fine-tune the convolutional neural network using the training data 102 (e.g., to generate an updated version of the machine learning model with respect to a first “pre-trained” version of the model). Here, the iterations of training by the processor subsystem 110 may include a forward propagation portion and a backward or generative propagation portion.

[0019] System 100 may further include an output interface for outputting a data representation 112 of a trained convolutional neural network, which data may also be referred to as trained model data 112. For example, as also shown in FIG. 1, the output interface may be constituted by a data storage interface 104, which in these embodiments is an input / output (“IO”) interface through which the trained model data 112 may be stored in a data storage 106. For example, the data representation 108 defining a “pre-trained” convolutional neural network may be at least partially replaced by the data representation 112 of the trained neural network during or after training in that the parameters of the convolutional neural network, such as weights, hyperparameters, and other types of parameters of the convolutional neural network, may be adapted to reflect training on the training data 102. This is also shown in FIG. 1 by reference numerals 108 and 112 referring to the same data record on the data storage 106. In other embodiments, the data representation 112 may be stored separately from the data representation 108 defining the “pre-trained” convolutional neural network. In some embodiments, the output interface may be separate from the data storage interface 104, but generally may be of the type described above for the data storage interface 104.

[0020] The system 100 shown in FIG. 1 is an example of a system that may be used to train and then execute a trained machine learning model described herein.

[0021] Figure 2 shows computer implementations for training and utilizing convolutional neural networks according to several embodiments. System 200 may include at least one computing system 202. Computing system 202 may include at least one processor 204 operably connected to a memory unit 208. Processor 204 may include one or more integrated circuits that implement the functions of a central processing unit (CPU) 206 and, in some embodiments, a graphics processing unit (GPU). CPU 206 may be a commercially available processing unit implementing an instruction set such as one of the x86, ARM, Power, or MIPS instruction set families. During operation, CPU 206 can execute stored program instructions retrieved from the memory unit 208. Stored program instructions may include software that controls the operation of CPU 206 to perform the operations described herein. In some examples, processor 204 may be a system-on-a-chip (SoC) that integrates the functions of CPU 206, memory unit 208, network interface, and input / output interface into a single integrated device. The computing system 202 can implement an operating system for managing various modes of operation.

[0022] The memory unit 208 may include volatile and non-volatile memory for storing instructions and data. Non-volatile memory may include solid-state memory such as NAND flash memory, magnetic and optical storage media, or any other suitable data storage device that retains data when the computing system 202 is deactivated or loses power. Volatile memory may include static and dynamic random access memory (RAM) for storing program instructions and data. For example, the memory unit 208 may store a machine learning model 210 or algorithm, a training dataset 212 for the machine learning model 210, a raw source dataset 214, and so on.

[0023] The computing system 202 may include a network interface device 220 configured to provide communication with external systems and devices. For example, the network interface device 220 may include a wired and / or wireless Ethernet interface as defined by the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard family. The network interface device 220 may also include a cellular communication interface for communicating with cellular networks (e.g., 3G, 4G, 5G). The network interface device 220 may be further configured to provide a communication interface to an external network 222 or the cloud.

[0024] The external network 222 may be referred to as the World Wide Web or the Internet. The external network 222 may establish standard communication protocols between computing devices. The external network 222 may enable the easy exchange of information and data between computing devices and the network. One or more servers 224 may communicate with the external network 222.

[0025] The computing system 202 may include an input / output (I / O) interface 218 which can be configured to provide digital and / or analog inputs and outputs. The I / O interface 218 may also include an additional serial interface (e.g., a Universal Serial Bus (USB) interface) for communicating with external devices.

[0026] The computing system 202 may include a human-machine interface (HMI) device 216, which may include any device that enables system 200 to receive control inputs. Examples of input devices may include human interface inputs such as keyboards, mice, touchscreens, voice input devices, and other similar devices. The computing system 202 may include a display device 226. The computing system 202 may include hardware and software for outputting graphics and text information to the display device 226. The display device 226 may include an electronic display screen, a projector, a printer, or other suitable device for displaying information to a user or operator. The computing system 202 may further be configured to enable interaction with remote HMIs and remote display devices via a network interface device 220.

[0027] System 200 can be implemented using one or more computing systems. While this example shows a single computing system 202 implementing all the described features, various features and functions are intended to be separated and implemented by multiple computing units communicating with each other. The specific system architecture chosen may depend on various factors.

[0028] System 200 may implement a machine learning algorithm 210 configured to analyze a raw source dataset 214. The raw source dataset 214 may include raw or unprocessed sensor data that can represent an input dataset for the machine learning system. In some examples, the machine learning algorithm 210 may be a convolutional neural network algorithm designed to perform a predetermined function. For example, the neural network algorithm may be configured to receive audio segments and text-based event descriptions, as in the case of the AFM306, which is further described below.

[0029] The computer system 200 can store a training dataset 212 for the machine learning algorithm 210. The training dataset 212 may represent a previously constructed set of data for training the machine learning algorithm 210. The training dataset 212 may be used by the machine learning algorithm 210 to learn the weighting coefficients associated with the convolutional neural network algorithm. The training dataset 212 may also include a set of source data with corresponding results that the machine learning algorithm 210 attempts to replicate through the learning process.

[0030] The machine learning algorithm 210 can operate in training mode using the training dataset 212 as input. The machine learning algorithm 210 may be run over several iterations using data from the training dataset 212. In each iteration, the machine learning algorithm 210 may update its internal weight coefficients based on the results achieved. For example, the machine learning algorithm 210 may compare its output results (e.g., annotations) with those contained in the training dataset 212. Since the training dataset 212 contains expected results, the machine learning algorithm 210 can determine when the performance is acceptable. After the machine learning algorithm 210 has achieved a predetermined level of performance (e.g., 100% agreement with the results associated with the training dataset 212), the machine learning algorithm 210 may be run using data not present in the training dataset 212. The trained machine learning algorithm 210 may be applied to a new dataset to generate annotated data.

[0031] The machine learning algorithm 210 may be configured to identify specific features in the raw source data 214. The raw source data 214 may include multiple instances or input datasets for which annotation results are desired. The machine learning algorithm 210 may be programmed to process the raw source data 214 to identify the presence of specific features. The machine learning algorithm 210 may be configured to identify features in the raw source data 214 as predetermined features. The raw source data 214 may be derived from various sources. For example, the raw source data 214 may be actual input data collected by a machine learning system. The raw source data 214 may be machine-generated data for testing the system. As an example, the raw source data 214 may include audio segments and text-based event descriptions related to a nearby audio environment.

[0032] In this example, the machine learning algorithm 210 can then process the raw source data 214 and output instructions indicating which of the text-based event descriptions are supported by the audio signals in the audio segment. The machine learning algorithm 210 can generate a confidence level or coefficient for each output produced. For example, a confidence value above a predetermined high confidence threshold can indicate that the machine learning algorithm 210 is confident that the identified feature corresponds to a particular feature. A confidence value below a low confidence threshold can indicate that the machine learning algorithm 210 has some uncertainty about the existence of a particular feature.

[0033] Figure 3 shows schematic diagrams of open-ended audio tracking systems according to several embodiments.

[0034] As shown in the open-ended audio tracking system 300, the framework includes two main modules that feed one to the other: (1) an AFM 306 that receives audio segments 304 and text-based event descriptions from a database 316 and outputs detected sound events 308; and (2) an LLM 310 to which the detected sound events 308 are then provided. The LLM 310 is then configured to process the detected sound events 308 and perform high-level acoustic scene inference to extend the text-based event description database 316 with additional text-based event descriptions for the next iteration cycle of the open-ended audio tracking system 300. Once the output of the AFM 306 is fed to the LLM 310, and the LLM 310 then provides its output to the AFM 306, the open-ended audio tracking system 300 is configured to operate at least in near real-time. Furthermore, even if the audio scene changes over time (for example, from an indoor kitchen setting to an outdoor baseball game setting), no additional retraining of the already trained AFM306 or LLM310 is performed, at least partially based on the use of the feedback loop described instead. Because the architecture shown in Figure 3 is training-free, the open-ended audio tracking system 300 is configured to iteratively and dynamically adapt to the relevant acoustic and sound scenes and events as an incoming audio segment is received.

[0035] In some embodiments, the AFM306 may be implemented as a CLAP module, and additionally, as herein referred to, CLAP4SED, or CLAP module for sound event detection (SED). CLAP can be defined as a type of comparative learning model that compares text-based data samples with audio-based data samples. As an audio-language-based model, CLAP is pre-trained on audio data and corresponding language descriptions, such as audio captions, tags, or titles, using a comparative objective, also referred herein to as objective loss (see also the description of Figures 1 and 2 provided above).

[0036] As will be further detailed below, the LLM 310 provides the AFM 306 with an initial seed of text-based event descriptions 302 during the first use of the open-ended audio tracking system 300, and then expands these samples with each subsequent iteration cycle of the open-ended audio tracking system 300. Thus, the methods and systems described herein for utilizing both the AFM 306 and the LLM 310 for audio tracking do not restrict the audio tracking system to a limited language and / or audio environment, but rather reduce the time and resources required to train the AFM 306 in different and specific audio environments, while enabling the unlimited capabilities of the LLM.

[0037] When implemented as a CLAP model, AFM306 may also be additionally defined by the following equation, where the target loss L is mapped via a modality-specific neural encoder to an audio embedding E. a and text embedding E t The objective is to align the latent spaces and establish meaningful connections between them. In some embodiments,

number

[0038]

number

[0039] CLAP further enables multi-directional interaction within the model. For example, given an audio signal as input, CLAP, also referred to as "audio input, text output," can perform audio tagging and captioning. Conversely, when queried using audio and free-form natural language prompts, CLAP can be configured to perform tasks such as audio retrieval and zero-shot audio classification, also referred to as "audio and text input, recognition output."

[0040] Therefore, the CLAP4SED module is configured to leverage the CLAP model for sound event detection (SED) and track the activity of specific sound events over time. Unlike traditional classification tasks, SED is configured to identify the start and end of an event's active time period. This is achieved by processing the real-time audio stream in small data portions or "chunks" that can point to a window size of a few seconds, rather than receiving the entire audio segment at once. The CLAP model is then configured to shift a small delta buffer amount, or a window hop size of approximately 100-500 ms, forward in time, thus enabling the streaming of audio in chunks collectively.

[0041] CLAP is an encoded text-based description (

number

number

[0042] Once the cosine similarity between the encoded text-based description and the portion of the audio segment is calculated, the AFM306 is configured to output a subset of the sound events actually present within the given audio portion or segment. For example, the sound event detection block 308 in Figure 3 shows that event 1 was detected by the AFM306's CLAP model over a given start and end time, and event 2 was detected over a longer start and end time within the total time length defined by the given portion of the audio segment 304 analyzed by the AFM306. Furthermore, event 3 was not detected over the entire time length defined by the given portion of the audio segment 304. In some embodiments, events 1 and 2 may be referred to as "active" events because their calculated cosine similarity exceeds a given threshold. On the other hand, event 3 may be referred to as an "inactive" event because its calculated cosine similarity falls below a given threshold.

[0043] The analysis information within the sound event detection block 308 is then provided to the LLM 310. In some embodiments, a subset of text-based descriptions corresponding to active events present within the audio portion or segment is provided to the LLM 310 in a structured or natural text format. For example, continuing the specific example shown in the sound event detection block 308, the analysis may be provided to the LLM 310 in a JSON format such as [{'label':'event1','start':1.2,'end':4.5},{'label':'event2','start':3.0,'end':6.5}]. This structured format indicates that the AFM 306 detected sound for event 1, which starts at 1.2 seconds and ends at 4.5 seconds, and sound for event 2, which starts at 3.4 seconds and ends at 6.5 seconds. This process enables low-level sound event tracking based on text-based event descriptions that expand over time, as further described in the following paragraphs.

[0044] As additionally shown in Figure 3, the LLM310 is configured to receive a subset of text-based descriptions corresponding to active sound events, (1) output an acoustic scene classification 314, and (2) generate an additional text-based event description 312. In some embodiments, the LLM310 may be implemented as a Generative Pre-trained Transformer (GPT) LLM such as ChatGPT. However, in other embodiments, the LLM310 may be implemented as any other large-scale language model configured to receive a subset of text-based descriptions corresponding to active sound events, (1) output an acoustic scene classification 314, and (2) generate an additional text-based event description 312.

[0045] The first of the two outputs provides a high-level acoustic scene summary that LLM estimates may originate from a subset of text-based descriptions. This classification defines acoustic scene categories such as "kitchen," "neighborhood park," or some other descriptive language, based on the sound event detection block 308.

[0046] Acoustic scene categories may include any high-level description of the local environment. Additional examples of acoustic scene category classifications are provided in Figures 5A and 5B and in the relevant descriptions herein.

[0047] In some embodiments, one or more prompts can be generated and provided to the LLM310 to perform this function. For example, a first prompt might be to determine the possible local environment of an audio segment based on a detected subset of sound events provided to the LLM. A second prompt or instruction might be to expand on the detected subset of sound events, as further described in the following paragraphs.

[0048] The second of the two outputs, namely the additional text-based event description 312, refers to an extension of an existing list of text-based event descriptions already stored in the database 316. For example, if the database 316 already has several text-based descriptions relating to nearby park sound scenes such as "children's laughter" and "ice cream truck," the LLM 310 may be configured to extend to sound events that could be identified by the AFM 306 when audio segments of nearby park sound scenes such as "dog barking" and "swinging." Additional examples of extending text-based event descriptions are provided in Figures 5A and 5B and in the relevant descriptions herein.

[0049] These additional text-based event descriptions 312 and acoustic scene classifications 314 are then stored in the text-based event description database 316 and subsequently provided to the AFM 306 during the next iteration cycle of the open-ended audio tracking system 300. Thus, from one iteration cycle to the next, the open-ended audio tracking system 300 resembles a self-adaptive method for identifying both a more comprehensive audio tracking framework and a more detailed audio tracking framework.

[0050] In some embodiments, additional text-based event descriptions 312 may be labeled as corresponding to their presence in the acoustic scene category 314 before being stored in the text-based event description database 316. For example, continuing the example described above, "dog barking" and "swinging" may be labeled as occurring in the acoustic scene of a nearby park.

[0051] Furthermore, according to some embodiments, the LLM310 may be additionally configured to detect false positives that occur during the detection of sound events present in an audio segment by the AFM306. In particular, it may be determined that a given false positive sound event may not correspond to the aggregated local environment of other sound events in the subset of sound event detections received by the LLM310. In such cases, the false positive is removed from the subset of sound events before any additional sound events are stored in the text-based event description database 316. For example, continuing the example described above, if “children’s laughter,” “ice cream truck,” and “kitchen blender” are detected by the AFM306 and are text-based descriptions provided to the LLM310 as part of the sound event detection block 308, the LLM310 may determine that “kitchen blender” is a false positive sound event in the subset because the aggregated local environment of other sound events may indicate that the audio segment 304 points to a nearby park.

[0052] Figure 4 shows schematic diagrams of the CLAP model of an open-end audio tracking system according to several embodiments.

[0053] In some embodiments, the AFM306 may be implemented and run as a CLAP model and by utilizing one or more components of system 200, such as a computing system 202. The AFM306 receives both a text-based data sample 400 and a portion of an audio segment 404 as inputs, in pairs. Each text input can be a word, phrase, or sentence linked or paired with a relevant audio signal that is expected to be present within the segment. For example, the text inputs could be "wind," "microwave," "people screaming," etc., which are text-based event descriptions that may be present within the current audio segment 304 or that may have been present in previously received audio segments.

[0054] The CLAP implementation of AFM306 leverages contrastive learning to generate a combined multimodal space for audio and text descriptions. CLAP takes audio and text pairs, processes them through separate encoders, and uses linear projection to bring these representations into a combined space. Specifically, CLAP uses two encoders, namely text encoder 402 and audio encoder 406, to connect linguistic and audio representations. This method aims to enable zero-shot prediction without requiring predefined categories during model training or execution. Both representations are connected in a combined multimodal space using linear projection. The space is generally learned using contrastive learning, as shown in 408, with respect to the (dis)similarity of audio and text pairs in batches.

[0055] Generally, the comparative learning shown in Figure 4 can be performed as follows. First, both text data 400 and audio data 404 are processed separately via dedicated encoders, yielding text embeddings and audio embeddings, respectively. These embeddings capture essential features or representations of the respective data. Some unrelated or different text phrases and audio segments can also be fed to the encoder. The embeddings are projected into a combined space using a learnable linear projection. This combined space is where the audio and text representations are compared and aligned. In the illustrated example, the text encoder 402 processes features T1, T2, T3, ..., T N The audio encoder 406 generates a text-based vector having the features A1, A2, A3, ..., A N Generates an audio-based vector containing the following:

[0056] Once the embeddings enter the combined space, the model calculates the similarity between the audio-text embeddings. Similarity can be measured using various metrics such as cosine similarity or Euclidean distance. For example, the model might assess how close or far apart the audio representation and its corresponding text representation are within this combined space. Contrastive learning uses a loss function that encourages the model to bring similar pairs closer together while pushing dissimilar pairs further apart. This calculates the loss based on the similarity between positive pairs (audio and text pairs belonging together) and negative pairs (pairs that do not correspond to each other). This encourages the model to learn representations that make similar pairs more distinguishable from dissimilar pairs. The diagonal of the resulting matrix 408 from this dot product represents the paired audio and text according to possible similarities, while the off-diagonal represents unpaired text and audio features (e.g., the sound of a person shouting and the text sample stating "a person is whispering"). Therefore, the goal of the AFM306's comparative learning method, when implemented as a CLAP model, is to minimize this comparative loss by tuning model parameters such as the encoder and projection layer. CLAP can then effectively learn to capture meaningful relationships between audio and text representations and associate relevant text descriptions with corresponding audio signals.

[0057] Figures 5A and 5B show, respectively, exemplary first and second iterative cycles that perform an open-ended audio tracking system according to several embodiments.

[0058] As described above with respect to Figure 3 and the open-ended audio tracking system 300, the open-ended audio tracking system 500 shows an additional embodiment that includes AFM 506 and LLM 510, whose frameworks operate in a feedback loop with each other. In the following description, Figure 5A may be treated as the first iteration cycle of the open-ended audio tracking system 500, and the text-based event description database 516 is assumed to be empty at the time immediately preceding the time shown in Figure 5A. Figure 5B may then be treated as the immediate following or second iteration cycle of the open-ended audio tracking system 500.

[0059] As shown in Figure 5A, an initial seed for text-based event descriptions 502 is provided to the text-based event description database 516 of the open-ended audio tracking system 500. In some embodiments, the initial seed 502 may include a small number of initial text-based event descriptions from which the open-ended audio tracking system 500 begins tracking. "Wind" and "microwave" are intended to be illustrative examples, and it should be understood that more or fewer initial text-based event descriptions may be used. Furthermore, the initial seed 502 may refer to a single word, phrase, or sentence describing an event sound in a variety of acoustic scenes. Moreover, text-based descriptions may refer to descriptions of sound events caused by humans, animals, machines, or other nature-based events (e.g., wind, thunder, rain, etc.).

[0060] Next, a first portion 504 of a given audio segment is provided to the AFM 506 along with a text-based event description in the database 516, which encodes the first portion of the audio segment 504 into audio embeddings and the text-based event description in the database 516 into text embeddings. In embodiments where the AFM 506 is implemented as a CLAP model, the embeddings are used to calculate the cosine similarity between each embedding in order to determine whether any sound events corresponding to the text-based event description in the database 516 exist within the first portion 504 of the given audio segment, if any.

[0061] As shown in a specific embodiment of the sound event detection block 508 in Figure 5A, “microwave” was detected for a predetermined start and end time, but “wind” was not detected at all for the duration of the first portion 504 of a predetermined audio segment.

[0062] Next, the start and end times of the "microwave oven" are provided to the LLM510 along with the text-based event description itself, "microwave oven". The execution of the LLM510 then includes determining that the acoustic scene classification 514 of the first part 504 of a given audio segment is "kitchen", based on the learning that the sound of a microwave oven has been detected.

[0063] The execution of LLM510 additionally includes the generation of various other additional text-based event descriptions 512 corresponding to other sound events that may also exist within the “kitchen” acoustic scene classification 514. “Cooking,” “frying,” and “laundry” are generated and output by LLM510, as shown in the specific embodiments indicated in the additional text-based event descriptions 512.

[0064] The additional text-based event description 512 is then stored in the text-based event description database 516 along with these labels for the “kitchen” acoustic scene classification.

[0065] Thus, the first iteration cycle of the open-ended audio tracking system 500 is completed, and the system 500 continues the loop, providing another round of text-based event descriptions and a second portion of a given audio segment to the AFM 506, as shown in Figure 5B.

[0066] In Figure 5B, a second portion 550 of a given audio segment is provided to the AFM 506 along with a text-based event description in database 558, which refers to an updated version of database 516 in which an additional text-based event description 512 is already stored internally, and then encodes the second portion 550 of the given audio segment into an audio embedding and the text-based event description in database 558 into a text embedding. In embodiments where the AFM 506 is implemented as a CLAP model, the embeddings are used to calculate the cosine similarity between each embedding in order to determine that any sound events corresponding to the text-based event description in database 558 are present in the second portion 550 of the given audio segment, if any.

[0067] As shown in a particular embodiment shown in the sound event detection block 552 of Figure 5B, “microwave” was detected for a given start and end time, and “frying” was detected for another given start and end time, but “wind” was not detected at all for the length of the second portion 550 of a given audio segment, nor was “cooking” or “laundry.”

[0068] Next, the start and end times of "microwave oven" are provided to LLM510 along with the text-based event description itself "microwave oven," and the start and end times of "deep-frying" are provided along with the text-based event description itself "deep-frying." The execution of LLM510 then includes a decision that the acoustic scene classification 556 of the second part 550 of a given audio segment is still "kitchen," based on having learned that the sounds of microwave oven and deep-frying have been detected.

[0069] The execution of LLM510 additionally includes the generation of various other additional text-based event descriptions 554 corresponding to many more sound events that may also exist within the “kitchen” acoustic scene classification 556. As shown in the specific embodiment indicated in the additional text-based event descriptions 554, “coffee machine” and “meal” are generated and output by LLM510.

[0070] The additional text-based event description 554 is then stored in the text-based event description database 558 along with these labels for the “kitchen” acoustic scene classification.

[0071] In this way, the second iteration cycle of the open-ended audio tracking system 500 is completed, and the system 500 continues the loop providing another round of text-based event descriptions and a third portion of a given audio segment to the AFM 506, and so on.

[0072] At a later point, if a text-based event description of "wind" is detected using AFM506, LLM510 can change the acoustic scene classification to "city street" or some other outdoor scene classification. Thus, the corresponding additional text-based event descriptions may then include sound events associated with "city street," such as "dog barking," "car passing," or "birdsong."

[0073] Because AFM and LLM are already pre-trained, the open-ended audio tracking system 500 is configured to dynamically adapt to various scenarios when they are introduced, even when acoustic scene classifications change dramatically (e.g., from "kitchen" to "city street"). No additional retraining is required, and the open-ended audio tracking system 500 is self-contained (e.g., without human intervention).

[0074] Figure 6 is a flowchart illustrating the process of executing an open-ended audio tracking system according to several embodiments. In some embodiments, process 600 may be used to describe a given iterative cycle of the open-ended audio tracking system 300. Process 600 may then be repeated as indicated by the arrow between block 650 and block 610, and as further described above with respect to iterations #1 and #2 of the open-ended audio tracking system 500.

[0075] In block 610, an audio segment or a portion of an audio segment is provided to an AFM such as CLAP, along with a text-based description from a text-based sound event description database. The text-based description corresponds to sound events that are detected or not detected by CLAP using cosine similarity calculations.

[0076] In block 620, the AFM is performed to detect one or more sound events present within the audio segment, which arise from a set of text-based descriptions described in block 610.

[0077] In block 630, a subset of text-based descriptions is provided to the LLM, which is configured to classify the audio segments into acoustic scene categories, as shown in block 640, and to generate additional text-based descriptions related to descriptions of other potential sound events that may occur within those acoustic scene categories.

[0078] In block 650, additional text-based descriptions are stored in the sound event database and accessed for future iterations when providing the text-based descriptions and audio segments to the AFM for further iteration cycles of the open-ended audio tracking system.

[0079] Figure 7 shows schematic diagrams of the interaction between a computer-controlled machine and a control system in several embodiments.

[0080] The methods and systems disclosed herein can be used in many different applications. This section provides some practical applications of the proposed systems.

[0081] As a first example, an open-ended audio tracking system may be implemented in a context-aware smart device. An open-ended acoustic scene detection system may also be integrated into an existing edge hardware device, thus providing additional context-aware capabilities to facilitate automated smart decisions. For example, hearing aid devices often require users to manually adjust microphone settings to achieve the best experience [2]. However, this ad-hoc adjustment can present further challenges for elderly or child users who may struggle to remember and manage different configurations. An integrated open-ended acoustic scene detection system can automatically adjust preset configurations based on detected scenes, thereby providing an optimized user experience.

[0082] As a first example, an open-ended audio tracking system can track both low-level and high-level audio content in near real-time or real-time, providing a comprehensive audio analytics solution. In a given example, the open-ended audio tracking system allows querying audio tracking results using LLM for tasks such as audio-based question answering to identify specific events, inferences about the sequence of events, or retrieval of information about anomalies over time. It can also be used on security cameras to monitor critical events such as gunshots and attacks.

[0083] As a second example, an open-ended audio tracking system can be implemented in a context-aware smart device. The open-ended acoustic scene detection system may also be integrated into existing edge hardware devices, thus providing additional context-aware capabilities to facilitate automated smart decisions. For example, previous implementations of hearing aid devices often required users to manually adjust microphone settings to achieve the best experience when moving between different types of acoustic scenes. However, this ad-hoc adjustment can present additional challenges for users who are elderly or children, as they may struggle to remember and manage different configurations in a timely manner, especially to avoid missing cues from different environments. An integrated open-ended acoustic scene detection system, on the other hand, can automatically adjust preset configurations based on detected acoustic scenes, thereby providing a more optimized user experience. Hearing aid device 800 and the following description provide additional examples of such integrations.

[0084] Figure 7 shows a schematic diagram of the interaction between the computer-controlled machine 700 and the control system 702. The computer-controlled machine 700 includes an actuator 704 and a sensor 706. The actuator 704 may include one or more actuators, and the sensor 706 may include one or more sensors. The sensor 706 is configured to detect the state of the computer-controlled machine 700. The sensor 706 may be configured to detect ID and / or OOD data, and the corresponding processor may be configured to determine whether the data is ID or OOD, in accordance with the teachings herein. The sensor 706 may be configured to encode the detected state into a sensor signal 708 and transmit the sensor signal 708 to the control system 702. Non-limiting examples of the sensor 706 include a microphone, a camera, a video sensor, a light sensor, and the like. In one embodiment, the sensor 706 is a microphone configured to receive audio signals from the environment adjacent to the computer-controlled machine 700.

[0085] The control system 702 is configured to receive a sensor signal 708 from the computer-controlled machine 700. As described below, the control system 702 may be further configured to calculate an actuator control command 710 in response to the sensor signal and to transmit the actuator control command 710 to the actuator 704 of the computer-controlled machine 700.

[0086] As shown in Figure 7, the control system 702 includes a receiving unit 712. The receiving unit 712 may be configured to receive a sensor signal 708 from the sensor 706 and convert the sensor signal 708 into an input signal x. In an alternative embodiment, the sensor signal 708 is received directly as an input signal x without the receiving unit 712. Each input signal x may be a part of each sensor signal 708. The receiving unit 712 may be configured to process each sensor signal 708 to generate each input signal x. The input signals x may include data corresponding to an image recorded by the sensor 706. For example, image-based data samples and text-based data samples may be received by the receiving unit 712.

[0087] The control system 702 includes an open-ended audio tracking subsystem 714. The open-ended audio tracking subsystem 714 may be configured to detect sound events in an audio signal received by the sensor 706. The open-ended audio tracking subsystem 714 is configured to be parameterized by parameters such as those described above (e.g., parameter θ). Parameter θ may be stored in and provided by non-volatile storage 716. The open-ended audio tracking subsystem 714 is configured to determine an output signal y from an input signal x. Each output signal y includes information that assigns one or more labels to each input signal x. The open-ended audio tracking subsystem 714 may transmit the output signal y to a conversion unit 718. The conversion unit 718 is configured to convert the output signal y into an actuator control command 710. The control system 702 is configured to transmit the actuator control command 710 to an actuator 704 configured to operate a computer-controlled machine 700 in response to the actuator control command 710. In other embodiments, the actuator 704 is configured to operate the computer-controlled machine 700 based directly on an output signal y.

[0088] When actuator 704 receives an actuator control command 710, actuator 704 is configured to perform an action corresponding to the associated actuator control command 710. Actuator 704 may include control logic configured to translate actuator control command 710 into a second actuator control command used to control actuator 704. In one or more embodiments, actuator control command 710 may be used to control a display instead of, or in addition to, the actuator.

[0089] In other embodiments, the control system 702 includes a sensor 706 in place of or in addition to the computer-controlled machine 700, which includes a sensor 706. The control system 702 may also include an actuator 704 in place of or in addition to the computer-controlled machine 700, which includes an actuator 704.

[0090] As shown in Figure 7, the control system 702 also includes a processor 720 and memory 722. The processor 720 may include one or more processors. The memory 722 may include one or more memory devices. One or more open-ended audio tracking subsystems 714 of an embodiment may be implemented by the control system 702, which includes non-volatile storage 716, a processor 720, and memory 722.

[0091] Non-volatile storage 716 may include one or more persistent data storage devices, such as hard drives, optical drives, tape drives, non-volatile solid-state devices, cloud storage, or any other devices capable of permanently storing information. Processor 720 may include one or more devices selected from a high-performance computing (HPC) system, including high-performance cores, microprocessors, microcontrollers, digital signal processors, microcomputers, central processing units, field-programmable gate arrays, programmable logic devices, state machines, logic circuits, analog circuits, digital circuits, or any other devices that manipulate signals (analog or digital) based on computer-executable instructions present in memory 722. Memory 722 may include a single memory device or several memory devices, including but not limited to random-access memory (RAM), volatile memory, non-volatile memory, static random-access memory (SRAM), dynamic random-access memory (DRAM), flash memory, cache memory, or any other devices capable of storing information. Furthermore, the processor 720 and memory 722 may be configured to provide the collected data to one or more other computing devices configured to run the open-end audio tracking subsystem in a domain-specific embodiment, also shown in Figure 8. Such collected data may be used to generate training and validation datasets for various stages in preparing and running machine learning models for industrial-grade applications. In the context described herein with respect to running the open-end audio tracking system, the processor 720 and memory 722 may be coupled to or, in some cases, remotely connected to a computing device, which can then perform audio tracking processes such as those described above.

[0092] The processor 720 may be configured to execute computer executable instructions that load into memory 722, reside in non-volatile storage 716, and embody one or more machine learning algorithms and / or methodologies of one or more embodiments. The non-volatile storage 716 may include one or more operating systems and applications. The non-volatile storage 716 may store compiled and / or interpreted computer programs written using a variety of programming languages ​​and / or techniques, including but not limited to Java, C, C++, C#, Objective-C, Fortran, Pascal, JavaScript, Python, Perl, and PL / SQL, either alone or in combination.

[0093] When executed by the processor 720, the computer-executable instructions of the non-volatile storage 716 can cause the control system 702 to implement one or more machine learning algorithms and / or methodologies as disclosed herein. The non-volatile storage 716 may also include machine learning data (including data parameters) that support the functions, features and processes of one or more embodiments described herein.

[0094] Program code embodying the algorithms and / or methodologies described herein may be distributed individually or collectively as various different forms of program products. The program code may be distributed using a computer-readable storage medium having computer-readable program instructions for causing a processor to execute an aspect of one or more embodiments. A computer-readable storage medium that is essentially non-transient may include volatile and non-volatile, removable and non-removable tangible media implemented by any method or technique for storing information such as computer-readable instructions, data structures, program modules, or other data. A computer-readable storage medium may further include RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid-state memory technology, portable compact disk read-only memory (CD-ROM) or other optical storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be read by a computer. Computer-readable program instructions can be downloaded from a computer-readable storage medium to a computer, another type of programmable data processing device, or another device, or via a network to an external computer or external storage device.

[0095] Computer-readable program instructions stored on a computer-readable medium may be used to instruct a computer, other types of programmable data processing devices, or other devices to function in a particular manner, resulting in the production of a product containing instructions that implement functions, actions, and / or operations specified in a flowchart or diagram. In certain alternative embodiments, the functions, actions, and / or operations specified in the flowchart and diagram may be sorted, processed sequentially, and / or processed simultaneously, in accordance with one or more embodiments. Furthermore, neither the flowchart nor the diagram may contain more or fewer nodes or blocks than those shown in accordance with one or more embodiments.

[0096] The process, method, or algorithm may be implemented, in whole or in part, using appropriate hardware components such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), state machines, controllers, or other hardware components or devices, or combinations of hardware, software, and firmware components.

[0097] Figure 8 shows a schematic diagram of the control system of Figure 7, configured to control the amplifier and speaker of a hearing aid device, according to several embodiments.

[0098] In some embodiments, the open-ended audio tracking subsystem 714 may be integrated into the hearing aid device 800. As shown in Figure 8, the hearing aid device 800 may include sensors, such as a microphone 802, configured to detect audio signals from the environment surrounding the hearing aid device 800. The detected audio signals are then provided to the open-ended audio tracking subsystem 714 of the control system 702, and the audio segments of the audio signals, along with various text-based event descriptions, are provided to the AFM 812. The AFM 812 is then run to detect several subsets of sound events present within a given audio segment.

[0099] A subset of sound events is then provided to LLM814, which, based on the detected subset of sound events, classifies the audio segments into acoustic scene categories and generates additional text-based descriptions corresponding to other descriptions of sound events associated with the classified acoustic scene categories. These additional text-based descriptions are then stored in the sound event description database 816.

[0100] In some embodiments, the control system 702 may be configured to provide acoustic scene classifications such that the control system extracts predefined parameters from the device's memory, from acoustic scene-specific parameters 808 related to the use of the hearing aid device in an environment matching an acoustic scene category, and then provides these predefined parameters to the receiver of the hearing aid device 800, e.g., the amplifier 804, and by extension, the speaker 806.

[0101] In other embodiments, the control system 702 may then be configured to update the signal-to-noise ratio based on a detected subset of sound events and provide the updated signal-to-noise ratio to the amplifier 804 and the speaker 806.

[0102] While exemplary embodiments have been described above, these embodiments are not intended to describe all possible forms that are covered by the claims. Terms used herein are descriptive, not restrictive, and it should be understood that various modifications can be made without departing from the spirit and scope of this disclosure. As noted above, features of various embodiments can be combined to constitute further embodiments of the invention that are not expressly described or illustrated. Various embodiments may be described as offering advantages or being preferable to other embodiments or prior art implementations with respect to one or more desired characteristics, but those skilled in the art will recognize that one or more features or characteristics may be compromised to achieve desired overall system attributes that depend on a particular application and implementation. These attributes may include, but are not limited to, cost, strength, durability, lifecycle cost, marketability, appearance, packaging, size, maintainability, weight, manufacturability, and ease of assembly. Therefore, to the extent that any embodiment is described as being less desirable than other embodiments or prior art implementations with respect to one or more characteristics, these embodiments may be desirable for a particular application, and not outside the scope of this disclosure.

Claims

1. Hearing aid device, A microphone configured to detect audio signals, Processor and The memory that stores program instructions, Equipped with, When the aforementioned program instruction is executed by the processor, the processor will be instructed to: Receiving an audio signal from the aforementioned microphone, The audio segment and text-based description of the aforementioned audio signal are provided to the Audio Fundamental Model (AFM), wherein the text-based description is stored in the memory and corresponds to a description of an audio event detected by the AFM. Performing the AFM to detect a subset of the sound events present in the audio segment, To provide a corresponding subset of the aforementioned text-based descriptions to a Large-Scale Language Model (LLM), The execution of the LLM is, Based on the detected subset of the aforementioned sound events, classify the audio segments into acoustic scene categories, and This includes generating additional text-based descriptions corresponding to other descriptions of sound events related to the aforementioned acoustic scene category, To provide the additional text-based description used when performing further iterations of the AFM using a different audio segment, A hearing aid device designed to perform [a specific action].

2. The aforementioned program instruction is given to the processor, The signal-to-noise ratio is updated based on the detected subset of the aforementioned sound events. To provide the updated signal-to-noise ratio to the speaker of the hearing aid device, This is to further implement the following: The hearing aid device according to claim 1.

3. The aforementioned program instruction is given to the processor, Extracting predefined parameters related to the use of the hearing aid device within the acoustic scene category from the memory, The aforementioned predefined parameters are provided to the receiver of the hearing aid device, This is to further implement the following: The hearing aid device according to claim 1.

4. The aforementioned program instruction is given to the processor, To provide the additional text-based description stored in the memory, In response to receiving another audio segment, provide the AFM with the text-based description, the additional text-based description, and the other audio segment for execution. This is to further implement the following: The hearing aid device according to claim 1.

5. When providing the additional text-based description to be stored in the memory, the program instruction causes the processor to further label the additional text-based description as corresponding to its presence in the sound scene category. The hearing aid device according to claim 4.

6. The text-based description corresponding to the description of a sound event includes a description of a sound caused by a human, animal, or machine. The hearing aid device according to claim 1.

7. The aforementioned acoustic scene category includes a high-level description of the local environment of the hearing aid device over the duration of the audio segment. The hearing aid device according to claim 1.

8. A computer implementation method for running an open-ended audio tracking system, The aforementioned method, The provision of audio segments and text-based descriptions to an audio-based model (AFM), wherein the text-based descriptions correspond to descriptions of sound events detected by the AFM, Performing the AFM to detect a subset of the sound events present in the audio segment, To provide a corresponding subset of the aforementioned text-based descriptions to a Large-Scale Language Model (LLM), The execution of the aforementioned LLM is, Based on the detected subset of the aforementioned sound events, classify the audio segments into acoustic scene categories, and This includes generating additional text-based descriptions corresponding to other descriptions of sound events related to the aforementioned acoustic scene category, To provide the additional text-based description used in further iterations of performing the open-ended audio tracking system, Computer implementation methods, including those mentioned above.

9. To provide the additional text-based descriptions stored in the event description database, In response to receiving another audio segment, provide the AFM with the text-based description, the additional text-based description, and the other audio segment for execution. Further including, The computer implementation method according to claim 8.

10. Before providing the additional text-based descriptions to be stored in the event description database, label the additional text-based descriptions as corresponding to their presence in the sound scene category. Further including, The computer implementation method according to claim 9.

11. Executing the aforementioned AFM means Encoding the aforementioned text-based description, Encoding the portion of the aforementioned audio segment, Calculating the cosine similarity between the encoded text-based description and the encoded portion of the audio segment, When the corresponding cosine similarity exceeds a threshold, it is determined that a given sound event exists within the audio segment. including, The computer implementation method according to claim 8.

12. Executing the aforementioned AFM means Determining the start and end times for the given sound event, The aforementioned start and end times are to be provided in addition to the LLM for execution, Further including, The computer implementation method according to claim 11.

13. The aforementioned AFM is a Contrasting Language Audio Pre-training (CLAP) model. The computer implementation method according to claim 8.

14. Further includes generating prompts to provide to the LLM for execution, The aforementioned prompt is, A first instruction for determining the possible local environment of the audio segment based on the detected subset of the provided sound events, A second instruction for expanding the detected subset of the aforementioned sound events, including, The computer implementation method according to claim 8.

15. Executing the aforementioned LLM means Detecting false positives from the subset of sound events based on the determination that a false positive sound event is unlikely to correspond to the aggregated local environment of other sound events within the subset, Before storing the subset of the aforementioned sound events in the event description database, the false positives are removed, Further including, The computer implementation method according to claim 8.

16. The aforementioned LLM is a generative pre-trained transformer (GPT) LLM. The computer implementation method according to claim 8.

17. A non-temporary computer-readable medium storing program instructions, When the program instruction is executed on or across a processor, the processor will be instructed to: The provision of audio segments and text-based descriptions to an audio-based model (AFM), wherein the text-based descriptions correspond to descriptions of sound events detected by the AFM, Performing the AFM to detect a subset of the sound events present in the audio segment, To provide a corresponding subset of the aforementioned text-based descriptions to a Large-Scale Language Model (LLM), The execution of the LLM is, Based on the detected subset of the aforementioned sound events, classify the audio segments into acoustic scene categories, and This includes generating additional text-based descriptions corresponding to other descriptions of sound events related to the aforementioned acoustic scene category, To provide the additional text-based description used for further iterations of the AFM execution using a different audio segment, A non-temporary, computer-readable medium intended for executing [a certain action].

18. In order for the AFM to be executed, the program instructions are given to the processor, Encoding the aforementioned text-based description, Encoding the portion of the aforementioned audio segment, Calculating the cosine similarity between the encoded text-based description and the encoded portion of the audio segment, When the corresponding cosine similarity exceeds a threshold, it is determined that a given sound event exists within the audio segment. This is to make it happen. The non-temporary computer-readable medium according to claim 17.

19. In order to execute the AFM, the program instructions are given to the processor, Determining the start and end times for the given sound event, The aforementioned start and end times are to be provided in addition to the LLM for execution, This is to further implement the following: The non-temporary computer-readable medium according to claim 18.

20. In order to execute the LLM, the program instructions are given to the processor, Based on the determination that a false positive sound event is unlikely to correspond to the aggregated local environment of other sound events within the subset, false positives are detected from the subset of sound events. Before storing the subset of the aforementioned sound events in the event description database, the false positives are removed, This is to further implement the following: The non-temporary computer-readable medium according to claim 17.