Animal call automatic labeling method and system based on spatio-temporal adaptive threshold
By employing a spatiotemporal adaptive thresholding method, utilizing a multi-classification model and frequency band filtering technology, the accuracy and robustness issues of automated labeling in marine passive acoustic monitoring were resolved, enabling precise capture and false alarm suppression of marine mammal calls.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG LAB
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-19
AI Technical Summary
In existing marine passive acoustic monitoring, automated annotation technology struggles to accurately capture long-envelope weak signals and suppress false alarms when facing complex marine environments, resulting in low annotation accuracy and inaccurate boundary positioning.
A spatiotemporal adaptive threshold-based method is adopted. By statistically analyzing the confidence scores of background noise and positive samples through a pre-trained multi-classification model, trigger thresholds and maintenance thresholds are set. Combined with hysteresis triggering judgment logic and multi-band filtering, accurate annotation of marine mammal calls is achieved.
It effectively prevents long chanting signals from breaking, extends to the end of the signal to the maximum extent, improves the accuracy of annotation, eliminates false alarms, and ensures the physical integrity and robustness of the annotation interval.
Smart Images

Figure CN122050396B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the interdisciplinary fields of bioacoustics and artificial intelligence, specifically to an automated annotation method and system for animal calls based on spatiotemporal adaptive thresholds. Background Technology
[0002] Passive acoustic monitoring (PAM) is a core method for acquiring information on the population distribution, migration patterns, and behavioral characteristics of marine mammals (such as whales, dolphins, and seals). However, in actual marine PAM scenarios, automated labeling faces extremely complex acoustic environmental challenges: on the one hand, background noise (such as wind and waves, biological choruses, and geological activity) and the self-noise of different hydrophone models vary significantly across different sea areas, resulting in highly uneven spatial distribution of acoustic background noise. On the other hand, marine mammal calls typically possess significant characteristics of "continuous chanting" and "long-distance communication." Due to the long-distance propagation of sound waves underwater, the receiver signal often exhibits significant reverberation and frequency spread, and the signal energy gradually attenuates as the propagation distance increases or the duration of the call extends, resulting in a significant decrease in the signal-to-noise ratio (SNR) over time.
[0003] Existing automated annotation techniques primarily rely on fixed threshold scores from a single model. This single-threshold method suffers from low annotation accuracy and extremely inaccurate call boundary localization when faced with the complex conditions described above: if the threshold is set too high, it fails to capture the weak energy region at the end of the call, resulting in missing or fragmented annotation boxes; if the threshold is set too low, it is highly susceptible to interference from ambient background noise or transient impulse noise, generating numerous false alarms. Therefore, how to achieve complete capture and false alarm suppression of long-envelope, weak-edge signals through adaptive mechanisms is an urgent need in current marine acoustics research. Summary of the Invention
[0004] The purpose of this invention is to address the shortcomings of existing technologies by providing an automated annotation method and system for animal vocalizations based on spatiotemporal adaptive thresholds. This invention is applicable to the vocalizations of marine mammals and can improve the accuracy and robustness of automated annotation in complex marine environments.
[0005] The objective of this invention is achieved through the following technical solution: The first aspect of this invention provides an automated annotation method for animal vocalizations based on spatiotemporal adaptive thresholds, comprising the following steps:
[0006] S1. Audio stream segmentation and preprocessing: Obtain the hydrophone recording file of the specified monitoring station, read it in segments to obtain audio stream fragments, and segment it into audio temporal feature frames;
[0007] S2. Perform site feature labeling: Based on audio temporal feature frames, manually extract background noise segments and positive sample segments of various calls, input them into the pre-trained multi-classification model, and statistically obtain the confidence score statistics of background noise and positive samples.
[0008] S3. Set adaptive thresholds: Use the confidence score statistics of positive samples to set trigger thresholds for different call categories, and use the confidence score statistics of background noise to set maintenance thresholds for different call categories.
[0009] S4. Perform preliminary target detection: Perform initial inference on each audio temporal feature frame through a pre-trained multi-classification model, output the corresponding confidence vector, and execute hysteresis triggering judgment logic based on trigger threshold and maintenance threshold to obtain the preliminary temporal interval of each call category.
[0010] S5. Perform multi-band filtering: For the preliminary time interval of each call category, retrieve the corresponding physical frequency band parameters and perform multi-band filtering;
[0011] S6. Execution time labeling interval boundary correction: After re-inputting each filtered audio temporal feature frame into the pre-trained multi-classification model for secondary inference, the hysteresis trigger judgment logic is executed again to correct the temporal boundary of each call category, while eliminating false alarms.
[0012] S7. Automated generation of structured annotation files: Outputs structured annotation files containing call categories, call temporal boundaries, and confidence scores.
[0013] Furthermore, the pre-trained multi-classification model is a convolutional neural network model, whose input is the Mel time-frequency map corresponding to the audio temporal feature frame, and whose output is a C-dimensional confidence vector, where each dimension corresponds to the confidence score of a category, and the value ranges from [0,1] to the probability value.
[0014] Furthermore, the confidence score statistics for the background noise include the mean background score and the standard deviation of the background score, and the confidence score statistics for the positive samples include the mean species score and the standard deviation of the species score.
[0015] The statistical acquisition of confidence scores for background noise and positive samples specifically includes: randomly selecting positive sample segments and background noise segments for each call category i; inputting each positive sample segment into a pre-trained multi-classification model, and outputting a C-dimensional confidence vector, where the i-th dimension is a component of the model. This represents the confidence score of the pre-trained multi-class classification model in believing that the positive sample segment belongs to class i; each background noise segment is input into the pre-trained multi-class classification model, and the output is a C-dimensional confidence vector, where the i-th dimension is the component. This represents the confidence score of the pre-trained multi-class classification model in identifying the background noise segment as belonging to class i; the mean and standard deviation of the confidence scores for all positive samples in class i are calculated to obtain the species mean score. and species score standard deviation The mean background score is obtained by calculating the average and standard deviation of the confidence scores for all background noise in category i. and background score standard deviation .
[0016] Furthermore, the positive sample segment specifically refers to an audio segment containing clear calls of the target species, which is manually extracted from the historical audio temporal feature frames of the monitoring station.
[0017] The background noise segment specifically refers to an audio segment that is manually extracted from the historical audio time-series feature frames of the monitoring station and verified to contain no calls of any target species.
[0018] Furthermore, the formula for calculating the trigger threshold is as follows:
[0019]
[0020] In the formula, For category i, the trigger threshold; Let be the mean score of species in category i. Let be the standard deviation of the species score for category i; This is the first fine-tuning coefficient corresponding to category i;
[0021] The formula for calculating the maintenance threshold is as follows:
[0022]
[0023] In the formula, The maintenance threshold for category i; The mean background score for category i. The standard deviation of the background score for category i; This is the second fine-tuning coefficient corresponding to category i;
[0024] Among them, the trigger threshold is greater than the sustain threshold.
[0025] Furthermore, the hysteresis trigger determination logic specifically includes:
[0026] When the initial inference confidence score for call category i Greater than or equal to the trigger threshold When this occurs, the initial vocal range for that category is determined to be open. If, while in the open state, Continuously greater than or equal to the maintenance threshold Then the interval remains open until Less than The interval is determined to be closed, thereby obtaining the preliminary time interval containing call category i; multiple preliminary time intervals with different call categories are allowed to exist in the same time period, and audio time feature frames that are not included in any preliminary time interval are marked as background noise.
[0027] Furthermore, the execution of multi-band filtering specifically includes:
[0028] For the audio timing feature frames corresponding to the preliminary timing interval of each call category, the physical frequency band parameters corresponding to category i are retrieved, and corresponding high-pass, low-pass, or band-pass frequency band filtering is applied according to the cutoff frequency or passband range defined by the physical frequency band parameters; among them, audio timing feature frames marked as background noise are not filtered.
[0029] Furthermore, the correction of the call timing boundaries for each call category specifically includes:
[0030] When the confidence score of the quadratic inference corresponding to category i Greater than or equal to the trigger threshold When the call range for that category is open, the start timestamp is recorded. When enabled, if Continuously greater than or equal to the maintenance threshold Then the interval remains open until Less than The interval is determined to be closed at that time, and the end timestamp is recorded. This is to correct the start and end timestamps of each call category and obtain the final call timing boundaries. ;
[0031] The false alarm is an audio temporal feature frame in which the confidence score of the secondary inference does not reach the trigger threshold.
[0032] A second aspect of the present invention provides an automated animal call annotation system based on spatiotemporal adaptive thresholds, comprising one or more processors and a memory, wherein the memory is coupled to the processors; wherein the memory is used to store program data, and the processor is used to execute the program data to implement the above-described automated animal call annotation method based on spatiotemporal adaptive thresholds.
[0033] A third aspect of the present invention provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, is used to implement the above-described method for automatically labeling animal calls based on spatiotemporal adaptive thresholds.
[0034] Compared with the prior art, the beneficial effects of the present invention are:
[0035] (1) Adapting to continuous chanting characteristics: The present invention uses a hysteresis triggering mechanism consisting of a trigger threshold calibrated by positive samples and a maintenance threshold based on background noise constraints to effectively prevent long chanting signals from breaking at energy fluctuations, thus ensuring the physical integrity of the calibrated interval.
[0036] (2) Addressing cross-distance communication: This invention uses background noise statistics as the lower limit constraint of the maintenance threshold, allowing the annotation logic to extend to the weak energy region at the end of the signal to the maximum extent while ensuring that it does not accidentally trigger the background noise, thus solving the problem of missed detection of weak signals.
[0037] (3) Dual verification of physical features: This invention introduces consistency verification based on category-related frequency band filtering, which provides physical-level filtering for broadband noise interference and greatly improves the accuracy of automated labeling. Attached Figure Description
[0038] Figure 1 This is a flowchart of the automatic annotation method for animal calls based on spatiotemporal adaptive thresholds according to the present invention;
[0039] Figure 2 This is a statistical distribution diagram of the confidence scores of background noise at the monitoring station and minke whales (Ba type) in one embodiment of the present invention;
[0040] Figure 3 This is a scatter plot showing the change of confidence score of a segment containing the call of a minke whale over time frames, according to one embodiment of the present invention.
[0041] Figure 4 This is a time-frequency comparison diagram of the same recording before and after filtering in one embodiment of the present invention;
[0042] Figure 5 This is a schematic diagram of an automated animal call annotation system based on spatiotemporal adaptive thresholds. Detailed Implementation
[0043] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with some aspects of the invention as detailed in the appended claims.
[0044] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular forms “a,” “the,” and “the” used in this invention and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.
[0045] It should be understood that although the terms first, second, third, etc., may be used in this invention to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first information may also be referred to as second information without departing from the scope of this invention, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."
[0046] The present invention will now be described in detail with reference to the accompanying drawings. Unless otherwise specified, the features of the following embodiments and implementations can be combined with each other.
[0047] See Figure 1 The automatic annotation method for animal calls based on spatiotemporal adaptive thresholds of the present invention specifically includes the following steps:
[0048] S1. Audio stream segmentation and preprocessing: Obtain the hydrophone recording file of the specified monitoring station, read the recording file in segments to obtain audio stream segments, and segment the audio stream segments into continuous audio temporal feature frames with temporal overlap according to the input time window length of the pre-trained multi-classification model and the preset sliding step size.
[0049] It's important to note that in real-world passive acoustic monitoring (PAM), hydrophone recordings are often long audio files, lasting several hours or even days (e.g., 24-hour continuous recordings), resulting in massive file sizes. Therefore, to accommodate computer memory limitations, the entire audio file isn't loaded into memory at once. Instead, a streaming processing mechanism is used, reading the recording file in segments to obtain audio stream fragments. For example, a fixed 1-minute (60-second) audio stream fragment is read into a buffer each time. After processing that segment, the next 1-minute segment is read, thus enabling continuous and uninterrupted processing of massive amounts of long-duration audio recordings.
[0050] Furthermore, the pre-trained multi-class classification model is a convolutional neural network model. Its input is the Mel time-frequency map corresponding to the audio temporal feature frames, and its output is a C-dimensional confidence vector, where each dimension corresponds to the confidence score of a class, with values ranging from [0,1]. The pre-trained multi-class classification model can use existing publicly available pre-trained multi-class classification models, such as multi-class classification models based on the EfficientNet-B0 architecture, Google's open-source google / multispecies-whale model, etc. The time window length is strictly determined by the receptive field of the pre-trained multi-class classification model.
[0051] As an optional implementation, the pre-trained multi-class classification model uses Google's open-source google / multispecies-whale model. This model supports input audio stream segments and outputs a 12-dimensional vector every 5 seconds, representing the confidence level of whether the audio contains the calls of 11 marine mammals and background noise. Therefore, the input time window length is 5.0 seconds. To avoid missed detections due to call signals being segmented at the window edges, the google / multispecies-whale model allows setting a sliding step size. For example, with a window length of 5 seconds and a step size of 1 second, there is a 4-second overlap between adjacent frames, i.e., an overlap rate of 80%.
[0052] S2. Perform site feature calibration: Based on audio temporal feature frames, manually extract background noise segments and positive sample segments of various marine mammal calls from the monitoring site, input them into the pre-trained multi-classification model, and statistically obtain the confidence score statistics of background noise and the confidence score statistics of positive samples.
[0053] Furthermore, the confidence score statistics for background noise include the mean background score and the standard deviation of the background score; the confidence score statistics for positive samples include the mean species score and the standard deviation of the species score.
[0054] Furthermore, the confidence score statistics for background noise and positive samples are statistically obtained. Specifically, this includes: randomly selecting positive sample segments and background noise segments for each call category i; inputting each positive sample segment into a pre-trained multi-classification model, and outputting a C-dimensional confidence vector, where the i-th dimension is a component of the model. This represents the confidence score of the pre-trained multi-class classification model in believing that the positive sample segment belongs to class i; each background noise segment is input into the pre-trained multi-class classification model, and the output is a C-dimensional confidence vector, where the i-th dimension is the component. This represents the confidence score of the pre-trained multi-class classification model in identifying the background noise segment as belonging to class i; the mean and standard deviation of the confidence scores for all positive samples in class i are calculated to obtain the species mean score. and species score standard deviation The mean background score is obtained by calculating the average and standard deviation of the confidence scores for all background noise in category i. and background score standard deviation .
[0055] Furthermore, positive sample segments are specifically defined as audio segments manually extracted from the historical audio time-series feature frames of the monitoring station that contain clear target species call categories (with minimal environmental interference). Background noise segments (i.e., negative sample segments) are specifically defined as audio segments manually extracted from the historical audio time-series feature frames of the monitoring station that have been verified to contain no target species call categories, and may include environmental background noise such as ship and equipment self-noise. In practice, 30-100 positive samples and 100-1000 negative samples are randomly selected for each marine mammal category; the larger the sample size, the more representative it is of the statistical regularity of the station, but the higher the corresponding manual cost.
[0056] It should be noted that the correlation between ocean background noise statistics and call categories needs clarification: although background noise is physically shared, the frequency bands and texture features that the convolutional kernels of pre-trained multi-class classification models focus on differ when identifying different species. In the confidence vector output by the pre-trained multi-class classification model, the component corresponding to category i... This represents the degree to which the pre-trained multi-class classification model considers the current background noise to be class i. Therefore, it is necessary to calculate the mean response of the background noise in the i-th dimension. and standard deviation Only then can the noise floor safety pad be accurately set for category i.
[0057] S3. Set adaptive thresholds: Use the confidence score statistics of positive samples to set trigger thresholds for different call categories. The confidence score statistic of background noise was used to set maintenance thresholds for different call categories. And satisfy , where i is the call category index.
[0058] Furthermore, trigger the threshold The calculation formula is:
[0059]
[0060] In the formula, For category i, the trigger threshold; Let be the mean score of species in category i. Let be the standard deviation of the species score for category i; This is the first fine-tuning coefficient corresponding to category i, with a value range of [0.5, 3.0]. For species with small intra-class variance and highly stable vocal characteristics, a smaller coefficient can be selected. (e.g., 1.0) to ensure high confidence in triggering; for species with variable vocal behavior and large variance, the confidence level needs to be appropriately increased. (e.g., 2.0-2.5) to expand the capture range of positive sample features.
[0061] Furthermore, maintain the threshold The calculation formula is:
[0062]
[0063] In the formula, The maintenance threshold for category i; The mean background score for category i. The standard deviation of the background score for category i; This is the second fine-tuning coefficient corresponding to category i, and its value range is [1.0, 10.0]. The specific value is set independently based on the signal-to-noise ratio of the target call category in the background noise and the signal envelope attenuation characteristics. For example, if the site is located in open deep sea with stable background noise, Smaller values are acceptable (e.g., 1.0-5.0); however, if the station is located in shallow near-shore waters and is severely affected by transient noise from shipping or waves, resulting in large fluctuations in floor noise, then the value needs to be increased. (e.g., 6.0-10.0) to raise the safety cushion and prevent environmental fluctuations from being mistaken for the end of a signal.
[0064] Furthermore, in setting the aforementioned first fine-tuning coefficient... With the second fine-tuning coefficient At that time, through coordinated control and The value of satisfies This establishes a safe "hysteresis buffer" between the target signal and the background noise.
[0065] S4. Perform preliminary target detection: Perform initial inference on each audio temporal feature frame through a pre-trained multi-classification model, output the C-dimensional confidence vector corresponding to each call category, and execute hysteresis triggering judgment logic in parallel based on trigger threshold and maintenance threshold for each call category to obtain the preliminary temporal interval of each call category and locate the temporal boundary of each call category.
[0066] Furthermore, the delayed triggering judgment logic specifically includes: when the initial inference confidence score of call category i... Greater than or equal to the trigger threshold When this occurs, the initial vocal range for that category is determined to be open. If, while in the open state, Continuously greater than or equal to the maintenance threshold Then the interval remains open until Less than The interval is determined to be closed, thereby obtaining the preliminary time interval containing call category i; multiple preliminary time intervals with different call categories are allowed to exist in the same time period, and audio time feature frames that are not included in any preliminary time interval are marked as background noise.
[0067] S5. Perform multi-band filtering: For the preliminary time intervals of each call category obtained in step S4, retrieve the corresponding physical frequency band parameters and perform multi-band filtering.
[0068] Specifically, for the audio timing feature frames corresponding to the preliminary timing intervals of each call category obtained in step S4, the physical frequency band parameters corresponding to category i are retrieved, and corresponding high-pass, low-pass, or band-pass frequency band filtering is applied according to the cutoff frequency or passband range defined by the physical frequency band parameters; among them, audio timing feature frames marked as background noise are not filtered, which can save a lot of computing power.
[0069] Furthermore, high-pass, low-pass, or band-pass filtering can be achieved by using different filters.
[0070] As one of the optional implementation schemes, the filter selection rules for different species are as follows: ① Extremely low frequency species (such as blue whales and fin whales): Their call fundamental frequency is usually between 10Hz and 100Hz. A low-pass filter (LPF) is suitable, with a cutoff frequency set at around 150Hz, to filter out high-frequency wind and wave noise and high-frequency mechanical noise from ships. ② Mid-to-high frequency pulse species (such as minke whales): Their call energy is often concentrated above 1000Hz. A high-pass filter (HPF) is suitable, with a cutoff frequency set at 800Hz to 1000Hz, to remove extremely severe low-frequency reverberation and dynamic noise from the ocean. ③ Broadband and ultra-high frequency species (such as various dolphins): Their whistles are usually distributed between 5kHz and 20kHz or even higher. A band-pass filter (BPF) is suitable, strictly limiting the passband to the core frequency range of their calls to eliminate other biological choruses and broadband pulses.
[0071] In addition, a Butterworth digital filter can be used because its passband response is extremely flat, which can preserve the core energy of the call to the greatest extent. At the same time, to ensure the absolute alignment of the signal on the time axis for accurate positioning of the bounding box, as an optional implementation, zero-phase filtering (such as the FLTFILT algorithm) can be used to avoid the phase distortion and time delay caused by ordinary causal filters.
[0072] S6. Time-labeled interval boundary correction: Each filtered audio temporal feature frame is re-inputted into the pre-trained multi-classification model for secondary inference, outputting the corresponding C-dimensional confidence vector. Based on the set trigger and maintenance thresholds, the hysteresis trigger judgment logic is executed again on the confidence score of the secondary inference to correct the start and end timestamps of each call category, thereby locating and locking the final call temporal boundary. Simultaneously, false alarms that did not reach the trigger threshold due to insufficient confidence scores in the secondary inference were removed; among which This represents the start timestamp of category i. This represents the end timestamp of category i.
[0073] Furthermore, the start and end timestamps of each call category are corrected, specifically including: when the confidence score of the second inference corresponding to category i is... Greater than or equal to the trigger threshold When the call range for that category is open, the start timestamp is recorded. When enabled, if Continuously greater than or equal to the maintenance threshold Then the interval remains open until Less than The interval is determined to be closed at that time, and the end timestamp is recorded. .
[0074] It should be noted that if the high confidence score in the initial inference is due to spurious features caused by out-of-band broadband mechanical noise, these features are physically removed after filtering, thus affecting the confidence score obtained in the second inference. It will drop sharply and will not be able to break through. Therefore, it will be automatically rejected as a false alarm.
[0075] Furthermore, the final call timing boundary is determined through a boundary inward contraction mechanism. Specifically, the beginning and end of the initial time interval often contain long multipath reverberations caused by the marine environment; after band filtering, out-of-band energy masking is eliminated. The confidence score curve of the second inference will show a steeper descent gradient at the true physical boundary of the target call than the first inference. At this point, using the same maintenance threshold... Truncation on curves with steeper descent gradients can precisely remove redundant silent and reverberant frames, causing the final generated start and end timestamps to shrink inward, thus locking the physical temporal boundaries of the call with extreme precision.
[0076] S7. Automated generation of structured annotation files: Outputs a structured annotation file containing call categories, final call temporal boundaries, and confidence scores.
[0077] As an optional implementation, the system automatically generates audio files, call category i, and final call timing boundaries. and confidence score Structured CSV or TXT annotation files.
[0078] It is worth mentioning that the embodiments of the present invention also provide an automated animal call annotation system based on spatiotemporal adaptive thresholds, which is used to implement the automated animal call annotation method based on spatiotemporal adaptive thresholds in the above embodiments. Figure 5 This is a schematic diagram of an automated animal call annotation system based on spatiotemporal adaptive thresholds provided by the present invention. The system includes one or more processors and a memory, with the memory coupled to the processors; wherein the memory stores program data, and the processor executes the program data to implement the automated animal call annotation method based on spatiotemporal adaptive thresholds described in the above embodiments.
[0079] The embodiments of the automatic animal vocalization annotation system based on spatiotemporal adaptive thresholds of this invention can be applied to any device with data processing capabilities, such as a computer or other similar devices or systems. The system embodiments can be implemented through software, hardware, or a combination of both. Taking software implementation as an example, as a logical system, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, such as... Figure 5 The diagram shown is a hardware structure diagram of any device with data processing capabilities within the automatic animal call annotation system based on spatiotemporal adaptive thresholds of this invention. (Except for...) Figure 5 In addition to the processor, memory, network interface (such as an audio acquisition module connecting a hydrophone array or local storage device), and non-volatile memory shown, any data processing device in the embodiment may also include other hardware, such as output interfaces and communication buses, depending on the actual function of that data processing device; these will not be elaborated further. The memory stores computer program code implementing the aforementioned adaptive threshold labeling method. The processor calls this program via the bus to efficiently complete inference, statistical calibration, hysteresis determination, and frequency band filtering verification of massive continuous audio streams, and exports the final results through the output interface.
[0080] The implementation process of the functions and roles of each unit in the above system is detailed in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0081] For the system embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0082] This invention also provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements the automatic annotation method for animal calls based on spatiotemporal adaptive thresholds described in the above embodiments.
[0083] The computer-readable storage medium can be an internal storage unit of any data processing device described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be any data processing device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units of any data processing device and external storage devices. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.
[0084] The following describes in detail the automatic annotation method and system for animal calls based on spatiotemporal adaptive thresholds of the present invention with reference to embodiments, which will make the purpose and effects of the present invention more apparent.
[0085] Example: Automatic labeling of minke whale (Ba type) calls
[0086] This embodiment takes the automated annotation of possible minke whale (model category label Ba) call intervals in the recording data of the Cross_A_02 monitoring station from the publicly available hydrophone recording dataset of a certain local passive acoustic monitoring network (PIPAN) as an example to describe in detail the implementation process and beneficial effects of the present invention.
[0087] Recording files from monitoring stations were acquired, and station feature labeling was performed. The pre-trained multi-classification model used Google's open-source deep learning model, google / multispecies-whale, which by default scores confidence within a 5-second time window. Sixty-three 5-second clips of minke whale calls were manually selected as positive samples, and 499 manually identified ocean background sound clips, including noises from ships, equipment, and other fish, were selected as negative samples. The multispecies-whale model can identify 12 call categories, including those of minke whales, humpback whales, and Bryde's whales. The confidence scores obtained by scoring positive and negative samples using the multispecies-whale model are shown below. Figure 2 As shown.
[0088] The distribution of dots on the left represents the scores of positive samples from minke whales (Ba class), and their mean confidence scores can be calculated. Standard deviation The asterisks on the right represent the confidence score distribution of the current site's marine background noise, from which the mean can be calculated. Standard deviation .Depend on Figure 2 It is readily apparent that the variance of the confidence scores for positive samples at some monitoring stations may be quite large, making it highly susceptible to misjudgment if a single fixed threshold is used. Therefore, a first fine-tuning coefficient is set. Then the threshold is triggered. Set the second fine-tuning coefficient. Then maintain the threshold .
[0089] Taking a 75-second audio stream segment from this site as an example, a trigger threshold is used to detect the start time frame of the minke whale call, and a maintenance threshold is used to position its end time frame. Since the energy of the minke whale call is mainly concentrated in the frequency band above 1000Hz, a 6th-order Butterworth high-pass filter with a cutoff frequency of 1000Hz is used, and zero-phase bidirectional filtering is employed to avoid phase distortion. The recordings before and after filtering are then input into a pre-trained multi-classification model. Figure 3 The scatter plot shows the change in confidence score over time frames before and after filtering, as well as the trigger threshold. and maintenance threshold The validity of the settings.
[0090] exist Figure 3 The scatter plot shown depicts the confidence score of an audio signal containing a minke whale's call changing over time, including the initial confidence score before filtering and the confidence score after a high-pass 1000Hz filter. When the score at a given time frame exceeds the upper limit... The line was triggered to activate the annotation, and although the score fluctuated afterward, it eventually fell below the line. The line remains labeled throughout. This figure visually demonstrates the effectiveness of using trigger threshold and sustain threshold hysteresis settings for continuous detection of minke whale calls and resistance to background noise fluctuations.
[0091] Figure 4 The actual effects of performing multi-band filtering and time-stamped interval boundary correction are demonstrated. Figure 4 The upper half is the Mel-scale time-frequency diagram of the original recording, and the lower half is the time-frequency diagram after applying a 1000Hz high-pass filter that matches the frequency band of the minke whale's call. The precise timing boundaries of the call are marked with boxes.
[0092] In the original time-frequency diagram, the characteristic frequency bands of the minke whale were masked by a large amount of low-frequency background wind, waves, and ship noise. This made the model, when based solely on full-band inference, susceptible to false alarms caused by broadband noise, and the temporal boundaries of the calls were blurred. By comparing the confidence scores before and after filtering, the system successfully eliminated high-scoring false alarms caused by low-frequency interference. Furthermore, as... Figure 4 The three segments of minke whale calls marked by boxes in the figure are shown in the time sequence intervals 0, 1, and 2. Using the method described in this invention, a tighter and more accurate time sequence boundary for the calls is defined on the time axis, which solves the problems of "the frame is too large and covers the noise" or "the frame is too small and breaks up the calls" in traditional methods.
[0093] In summary, the present invention, through a series of methods combining statistical and acoustic physical characteristics as demonstrated in the above embodiments, significantly overcomes the shortcomings of inaccurate single threshold labeling and susceptibility to noise interference, providing a robust technical solution for marine ecological monitoring.
[0094] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for automatic annotation of animal vocalizations based on spatio-temporal adaptive threshold, characterized in that, Includes the following steps: S1. Obtain the hydrophone recording file of the specified monitoring station, read it in segments to obtain audio stream fragments, and divide it into audio temporal feature frames; S2. Based on audio temporal feature frames, manually extract background noise segments and positive sample segments of various calls, input them into the pre-trained multi-classification model, and statistically obtain the confidence score statistics of background noise and positive samples. S3. Use the confidence score statistics of positive samples to set trigger thresholds for different call categories, and use the confidence score statistics of background noise to set maintenance thresholds for different call categories. The formula for calculating the trigger threshold is: In the formula, For category i, the trigger threshold; Let be the mean score of species in category i. Let be the standard deviation of the species score for category i; This is the first fine-tuning coefficient corresponding to category i; The formula for calculating the maintenance threshold is as follows: In the formula, The maintenance threshold for category i; The mean background score for category i. Let i be the standard deviation of the background score for category i; This is the second fine-tuning coefficient corresponding to category i; Among them, the trigger threshold is greater than the sustain threshold; S4. Perform initial inference on each audio temporal feature frame through a pre-trained multi-classification model, output the corresponding confidence vector, and execute hysteresis triggering judgment logic based on trigger threshold and maintenance threshold to obtain the preliminary temporal interval of each call category. S5. For the preliminary time interval of each call category, retrieve the corresponding physical frequency band parameters and perform multi-band filtering; S6. Re-input each filtered audio temporal feature frame into the pre-trained multi-classification model for secondary inference, and then execute the hysteresis trigger judgment logic again to correct the temporal boundary of each call category, while eliminating false alarms. S7. Output a structured annotation file containing call categories, call temporal boundaries, and confidence scores.
2. The spatio-temporal adaptive threshold based animal vocalization automated annotation method according to claim 1, wherein, The pre-trained multi-classification model is a convolutional neural network model. Its input is the Mel time-frequency map corresponding to the audio temporal feature frame, and its output is a C-dimensional confidence vector. Each dimension corresponds to the confidence score of a category, and the value ranges from [0,1].
3. The spatio-temporal adaptive threshold based animal vocalization automated annotation method of claim 1, wherein, The confidence score statistics for the background noise include the mean background score and the standard deviation of the background score, and the confidence score statistics for the positive samples include the mean species score and the standard deviation of the species score. The statistical acquisition of confidence scores for background noise and positive samples specifically includes: randomly selecting positive sample segments and background noise segments for each call category i; inputting each positive sample segment into a pre-trained multi-classification model, and outputting a C-dimensional confidence vector, where the i-th dimension is a component of the model. This represents the confidence score of the pre-trained multi-class classification model in believing that the positive sample segment belongs to class i; each background noise segment is input into the pre-trained multi-class classification model, and the output is a C-dimensional confidence vector, where the i-th dimension is the component. This represents the confidence score of the pre-trained multi-class classification model in identifying the background noise segment as belonging to class i; the mean and standard deviation of the confidence scores for all positive samples in class i are calculated to obtain the species mean score. and species score standard deviation The mean background score is obtained by calculating the average and standard deviation of the confidence scores for all background noise in category i. and background score standard deviation .
4. The automatic annotation method for animal calls based on spatiotemporal adaptive thresholds according to claim 3, characterized in that, The positive sample segment is specifically an audio segment containing clear calls of the target species, which is manually extracted from the historical audio temporal feature frames of the monitoring station. The background noise segment specifically refers to an audio segment that is manually extracted from the historical audio time-series feature frames of the monitoring station and verified to contain no calls of any target species.
5. The spatio-temporal adaptive threshold based animal vocalization automated annotation method of claim 1, wherein, The hysteresis trigger determination logic specifically includes: When the initial inference confidence score for call category i Greater than or equal to the trigger threshold When this occurs, the initial vocal range for that category is determined to be open. If, while in the open state... Continuously greater than or equal to the maintenance threshold Then the interval remains open until Less than The interval is determined to be closed, thereby obtaining the preliminary time interval containing call category i; multiple preliminary time intervals with different call categories are allowed to exist in the same time period, and audio time feature frames that are not included in any preliminary time interval are marked as background noise.
6. The spatio-temporal adaptive threshold based animal vocalization automated annotation method of claim 1, wherein, The execution of multi-band filtering specifically includes: For the audio timing feature frames corresponding to the preliminary timing interval of each call category, the physical frequency band parameters corresponding to category i are retrieved, and corresponding high-pass, low-pass, or band-pass frequency band filtering is applied according to the cutoff frequency or passband range defined by the physical frequency band parameters; among them, audio timing feature frames marked as background noise are not filtered.
7. The spatio-temporal adaptive threshold based animal vocalization automated annotation method according to claim 1, wherein, The correction of the vocal timing boundaries for each vocal category specifically includes: When the confidence score of the quadratic inference corresponding to category i Greater than or equal to the trigger threshold When the call range for that category is open, the start timestamp is recorded. When enabled, if Continuously greater than or equal to the maintenance threshold Then the interval remains open until Less than The interval is determined to be closed at that time, and the end timestamp is recorded. This is to correct the start and end timestamps of each call category and obtain the final call timing boundaries. ; The false alarm is an audio temporal feature frame in which the confidence score of the secondary inference does not reach the trigger threshold.
8. An automated animal vocalization annotation system based on spatiotemporal adaptive thresholds, comprising one or more processors and a memory, characterized in that, The memory is coupled to the processor; wherein the memory is used to store program data, and the processor is used to execute the program data to implement the automatic annotation method for animal calls based on spatiotemporal adaptive thresholds according to any one of claims 1-7.
9. A computer-readable storage medium, characterized in that, It stores a program that, when executed by a processor, is used to implement the automatic annotation method for animal calls based on spatiotemporal adaptive thresholds as described in any one of claims 1-7.
Citation Information
Patent Citations
Pet sound time point positioning method based on deep learning and time-frequency characteristics
CN120260586A
Sound data processing method, electronic device, storage medium and computer program product
CN120977309A