An audio clustering method based on adaptive audio fingerprints
By using adaptive audio fingerprinting technology, which adjusts the sampling step size and feature buffer pool using spectral flux, the problems of data redundancy and low computational efficiency in existing audio fingerprinting technologies are solved, and efficient audio clustering is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LINKER
- Filing Date
- 2026-03-03
- Publication Date
- 2026-06-30
Smart Images

Figure CN122309801A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of audio signal processing and computer information retrieval technology, and in particular to an audio clustering method based on adaptive audio fingerprints. Background Technology
[0002] Audio fingerprinting technology plays a crucial role in the digital music industry, copyright monitoring, and short video content distribution. An audio fingerprint is a compact digital summary extracted from an audio signal using specific algorithms, representing the core acoustic features of that audio. Based on audio fingerprints, systems can achieve rapid audio identification (retrieval), deduplication, and cluster analysis.
[0003] Traditional audio fingerprinting algorithms (such as the classic Philips algorithm or Shazam algorithm) typically employ "fixed parameter" processing logic: Fixed frame segmentation: Audio is segmented using a fixed frame length (e.g., 37ms) and a fixed overlap ratio (e.g., 31 / 32 overlap).
[0004] Constant output: Regardless of whether the audio content is a dramatic rock climax or a nearly still background sound, the algorithm generates fingerprint data at a constant rate.
[0005] This existing technology has the following significant technical problems when dealing with clustering tasks involving massive amounts of audio data: Significant data redundancy and high storage and computation costs: For stable segments in audio (such as long notes, silences, or repetitive rhythms), fixed high-density sampling generates a large number of highly similar or even duplicate fingerprints. This not only wastes server storage space but also leads to massive amounts of invalid computation in subsequent clustering comparisons, severely slowing down system processing speed.
[0006] The contradiction between feature capture capability and data volume: In order to capture transient details in audio (such as fast drum beats), an extremely high overlap rate (i.e., a very small step size) is usually required, but this will further exacerbate data inflation; conversely, if the step size is increased in order to reduce the amount of data, key transient features are easily missed, leading to a decrease in clustering accuracy.
[0007] Weak noise resistance: Existing algorithms often lack intelligent filtering mechanisms for silence or noise at the beginning and end of audio, causing the fingerprints generated by these non-core information to interfere with the clustering results and reduce the accuracy of classification.
[0008] Therefore, there is an urgent need for a technical solution that can adaptively adjust the sampling strategy according to the audio content, so as to significantly reduce the amount of fingerprint data while maintaining or even improving the feature representativeness and clustering accuracy. Summary of the Invention
[0009] This invention aims to address the problems of excessive fingerprint data volume, low computational efficiency, and lack of flexibility in processing steady-state and transient signals in existing audio fingerprinting technologies. It proposes an audio clustering method based on adaptive audio fingerprints. This method achieves dynamic scaling of the sampling step size by introducing spectral flux as a control variable and realizes adaptive integration of fingerprint windows through a feature buffer pool mechanism. This scheme can significantly reduce the number of fingerprints and improve clustering efficiency while maintaining the discriminative power of audio fingerprints.
[0010] The present invention addresses the aforementioned technical problems primarily through the following technical solution: an audio clustering method based on adaptive audio fingerprints, comprising the following steps: S1: Adaptive framing processing is performed on the audio signal to obtain an audio frame sequence; the adaptive framing processing includes: calculating the spectral flux of the current analysis frame starting from a preset initial time point, and determining the sampling step size of the next frame based on the spectral flux of the current analysis frame; wherein the spectral flux of the current analysis frame and the sampling step size of the next frame are negatively correlated. Spectral flux is used to characterize the degree of change in an audio signal in the frequency domain. This invention establishes a negative correlation between spectral flux and sampling step size: a larger spectral flux indicates more drastic signal changes (such as transients, percussion, or pitch shifts), in which case a smaller sampling step size (increasing sampling density) is used to ensure fine capture of signal details; a smaller spectral flux indicates a more stable signal (such as sustained notes or background atmosphere), in which case a larger sampling step size (reducing sampling density) is used to reduce data redundancy. The core of adaptive framing processing lies in changing the traditional fixed step size mode.
[0011] S2: Based on the audio frame sequence, a fingerprint unit is generated. In order to further compress the data and extract representative "audio events", this invention does not directly generate a fingerprint for each frame, but introduces a feature cache pool mechanism. Specifically, a feature cache pool is established, and the feature vectors of consecutive audio frames in the audio frame sequence are stored in the feature cache pool in sequence. A preset window integration condition is set, which aims to determine whether the current audio segment is maintained within the same "acoustic event". When the feature cache pool meets the preset window integration condition, it means that the current acoustic event has ended or reached its maximum length. At this time, all feature vectors in the feature cache pool are aggregated and calculated (e.g., weighted average) to generate a fingerprint unit representing the time period, and the feature cache pool is cleared to prepare to receive data from the next event. S3: Construct an audio fingerprint based on several generated fingerprint units, and calculate the similarity between audio signals based on the audio fingerprint to cluster the audio signals. The final generated sequence of fingerprint units constitutes the complete fingerprint of the audio signal, and clustering is achieved by calculating the distance between fingerprints (such as Hamming distance).
[0012] Preferably, the specific formula for calculating the spectral flux of the current analysis frame in step S1 is as follows: ; Among them, SF i Let |X| represent the spectral flux of the i-th current analysis frame, N represent the total number of frequency points, and |X| represent the frequency flux of the current analysis frame. i (k)| represents the amplitude spectrum of the i-th frame at the k-th frequency point, |X i-1 (k)∣ represents the amplitude spectrum of the previous frame at the k-th frequency point.
[0013] It should be noted that although the above formula uses the squared difference form based on Euclidean distance (L2 norm), which is more sensitive to large transient changes and is the preferred method of the present invention, in other embodiments of the present invention, the spectral flux can also be calculated using the absolute difference, cosine distance, or half-wave rectification based on Manhattan distance (L1 norm). As long as the physical quantity can quantitatively characterize the degree of spectral change between adjacent frames, it is within the protection scope of the present invention.
[0014] Preferably, the specific process of dynamically determining the sampling step size of the next frame based on the spectral flux in step S1 includes: Preset minimum step size H min (e.g., 10ms), maximum step size H max (e.g., 40ms-60ms), high throughput threshold T high and low flux threshold T low ; Obtain the spectral flux SF of the current analysis frame. i ; If SF i >T high If this indicates that the signal is in a region of rapid change, then the sampling step size Step for the next frame will be set to H. min ; If SF i <T low If this indicates that the signal is in the steady-state region, then the sampling step size Step for the next frame will be set to H. max ; If T low ≤SF i ≤T highThen, the sampling step size Step for the next frame is calculated according to the linear interpolation strategy (or sigmoid function mapping), and the calculation formula is: Step = H min +(H max -H min (T) high -SF i ) / (T high -T low This logic ensures that the step size transitions smoothly between the maximum and minimum values based on the signal content.
[0015] The absolute value of spectral flux depends on the sampling depth and normalization method of the audio signal. Assuming the audio signal has been normalized to the [-1, 1] interval, the high-throughput threshold T can be determined in the following two ways. high and low flux threshold T low : 1. Empirical value setting method: In a typical implementation scenario, after extensive experimental verification, the low throughput threshold T can be set. low Set between 0.05 and 0.1 to filter background noise or very subtle changes in ambient sound; set the high-throughput threshold T... high Set between 0.5 and 0.8; this value typically corresponds to moments of significant drum beats or melodic abrupt changes.
[0016] 2. Adaptive Statistical Setting Method (Preferred): To adapt to audio files of different volume levels, a dynamic threshold setting strategy based on statistical distribution is preferred. The specific steps are as follows: The mean spectral flux μ of the current audio segment (e.g., the first 5 seconds) is calculated in advance. sf and standard deviation σ sf ; Set T low =μ sf +0.5σ sf ; Set T high =μ sf +2σ sf ; Using this statistical method, the algorithm can accurately identify relatively drastic change areas and stable areas regardless of whether the input audio is loud or soft.
[0017] Preferably, in step S2, before storing the feature vectors of consecutive audio frames in the audio frame sequence into the feature cache pool, a feature extraction step is also included: A1: Perform windowing (such as Hanning window) and Fast Fourier Transform (FFT) on each frame in the audio frame sequence; A2: Divide the frequency band from 300Hz to 2000Hz (the area where the human ear is sensitive and where the main melody is distributed) into 33 sub-bands according to the Bark scale to simulate the auditory perception characteristics of the human ear; A3: Calculate the energy value of each subband and construct the 33-dimensional feature vector of the frame.
[0018] Preferably, after calculating the energy value of each subband, a sparsification step is also included: The energy values of the 33 subbands are sorted. Only retain the energy data of the top M subbands with the largest energy values (e.g., Top-5), and set the energy data of the remaining subbands to zero or discard them, where M is a positive integer less than 33; The subsequent operation of storing the feature vectors into the cache pool is performed based on the sparsed feature vectors.
[0019] This sparsity reduction process not only further compresses the data, but more importantly, it improves the robustness of the fingerprint. This is because in real-world environments, lower-energy frequency bands are easily affected by background noise, while the highest-energy frequency bands typically represent the main components of the audio (such as the fundamental frequency and harmonics). Preserving these strong features can effectively resist noise interference.
[0020] Preferably, in step S2, the preset window integration conditions include the following logic: Calculate the feature vector V of the audio frame to be stored. curr With the first frame feature vector V in the feature cache pool head The similarity distance between them; If the similarity distance is greater than the preset difference threshold (indicating that the signal properties have changed), or the accumulated time length in the feature cache pool exceeds the preset maximum window length (to prevent the fingerprint unit from being too long), then it is determined that the window integration condition is met (i.e., triggering the generation of a new fingerprint). If the similarity distance is less than or equal to the preset difference threshold, and the accumulated time length in the feature cache pool does not exceed the preset maximum window length, then it is determined that the window integration condition is not met, and V continues to be processed. curr Stored in the feature cache pool.
[0021] In this scheme, the similarity distance is an indicator that measures the degree of difference between vectors. The larger the value, the greater the difference (the lower the similarity).
[0022] Preferably, in step S2, the feature vectors in the feature cache pool are aggregated and calculated to generate a fingerprint unit representing the time period, specifically including: Calculate the average value of all feature vectors in the feature cache pool across all dimensions to obtain the average feature vector; The average feature vector is hash-encoded to generate the fingerprint unit.
[0023] By averaging, short-term fluctuation noise within the window can be smoothed out, and the statistical average characteristics within that time period can be extracted.
[0024] Preferably, the specific steps for hash encoding the average feature vector include: The average feature vector of the fingerprint units generated in the previous time period is used as the reference vector; Each dimension of the average feature vector of the current fingerprint unit is compared with the corresponding dimension of the reference vector one by one; If the value of the current dimension is greater than the value of the corresponding reference dimension, then the binary bit of that dimension is marked as "1", otherwise it is marked as "0", thus generating a binary fingerprint sequence composed of "0" and "1".
[0025] This time-domain differential coding allows the generated fingerprint to focus on the trend of spectral energy change over time, rather than the absolute energy value. Therefore, the fingerprint is naturally invariant to changes in volume, greatly improving the accuracy of matching.
[0026] Preferably, before step S1, a step of determining the effective audio fingerprint extraction area is included: Calculate the energy envelope of the entire audio signal; Identify continuous regions within the energy envelope that consistently exceed a preset proportion of the average energy, and designate them as the core region; Steps S1 to S3 are performed only on the core area to exclude silence or invalid information at the beginning and end of the audio.
[0027] This step automatically avoids silent or meaningless sections in the intro and outro, allowing you to focus on the chorus or climax.
[0028] Preferably, step S3, which calculates the similarity between audio files based on the audio fingerprint, specifically includes: Similarity is measured by calculating the Hamming distance between two audio fingerprint sequences; If the Hamming distance is less than a preset matching threshold, then the two audio signals are determined to belong to the same cluster.
[0029] The audio fingerprint generated by this invention is a binary sequence composed of several fingerprint units. The clustering process in step S3 aims to group different copies (which may contain recording noise, different compression rates, or edited versions) belonging to the same song (or the same type of audio) into one category. Specifically, it includes the following sub-steps: Step S31: Construct a similarity matrix (or adjacency list); For the audio set to be clustered, calculate the normalized Hamming distance between each pair of its fingerprints: BER = Hamming Distance (FP) A ,FP B ) / TotalBits; Among them, FP A and FP B It is a fingerprint sequence of two audio segments.
[0030] Step S32: Preliminary matching determination; Set a matching threshold D th (e.g., 0.3).
[0031] If BER <D th If so, it is determined that audio A and audio B are strongly correlated, and an edge is established between them.
[0032] Because this invention uses an adaptive step size, the fingerprint lengths generated by the two audio segments may not be exactly the same. Before calculating the distance, a sliding window matching or dynamic time warping (DTW) algorithm can be used to align the fingerprint sequences and find the minimum distance at the best matching position as the final distance.
[0033] Step S33: Cluster generation based on connected components; Treat all audio as nodes in the graph, and construct an undirected graph based on the edges established in step S32.
[0034] Traverse the graph to find all connected components, each of which represents a cluster.
[0035] For example, if audio A is similar to B, and B is similar to C, even if the direct similarity between A and C is slightly lower, A, B, and C will be grouped into the same cluster through the bridging of B, thus effectively handling different variations of audio (such as A being the original, B being a 128kbps compressed version, and C being a live recording).
[0036] Step S34: Cluster center extraction (optional); For each generated cluster, the audio fingerprint with the most connections (highest degree) to other nodes is selected as the central fingerprint of the cluster for subsequent fast indexing or representative display.
[0037] The substantial effects of this invention are: (1) Significantly reduced storage and computation costs: By adjusting the adaptive step size based on spectral flux, this invention automatically reduces the sampling density in stable audio regions; combined with the adaptive window integration of the feature buffer pool, consecutive similar frames are merged into a single fingerprint unit. Experiments show that, while maintaining the same clustering accuracy, this method can reduce the amount of generated audio fingerprint data by 50%-80%. This directly and significantly reduces the storage space requirements of the fingerprint database and significantly reduces the CPU computation load when calculating the Hamming distance, greatly improving system efficiency.
[0038] (2) Improve clustering accuracy and robustness: This invention automatically encrypts sampling in areas of drastic audio changes (high-throughput areas), ensuring the fine capture of transient features (such as drum beats and rhythms) and avoiding missed sampling caused by fixed large step sizes; at the same time, through Bark scale segmentation, Top-M sparsification selection and temporal differential hashing, the fingerprint has strong robustness to background noise, compression distortion and volume changes.
[0039] (3) Intelligent removal of invalid information: By extracting the core region based on the energy envelope, the present invention can automatically identify and focus on the most representative segments in the audio (such as the climax of a song), effectively eliminating the interference of silence at the beginning and end and non-critical information, and further improving the accuracy and recall of clustering. Attached Figure Description
[0040] Figure 1 This is a flowchart of an audio clustering method based on adaptive audio fingerprinting according to the present invention. Detailed Implementation
[0041] The technical solution of the present invention will be further described in detail below through embodiments and in conjunction with the accompanying drawings.
[0042] The method provided by this invention can run on any electronic device with computing capabilities, such as a server, workstation, personal computer, or cloud computing cluster. This device typically includes a processor (CPU / GPU), memory (RAM / ROM), and audio input / output interfaces. For large-scale audio clustering tasks, it is preferably executed in a distributed computing environment, where the memory is used to maintain a feature cache pool and a fingerprint index library, and the processor is used to execute the signal processing logic described below.
[0043] Example: An audio clustering method based on adaptive audio fingerprinting, such as... Figure 1 As shown, it mainly includes four stages: preprocessing and segmentation, adaptive framing and feature extraction, fingerprint unit generation, and cluster analysis.
[0044] 1. Preprocessing and effective region extraction (intelligent segment selection) To avoid interference with clustering accuracy from silence, noise, or non-core content (such as the intro of a song or gaps in dialogue) at the beginning and end of the audio, this embodiment first uses the energy envelope segmentation method to determine the effective processing area.
[0045] Envelope calculation: Calculate the short-time energy of the entire audio signal.
[0046] Threshold determination: Calculate the average energy E of the entire curve. avg Set the energy threshold T. energy =λ·E avg (For example, λ=1.2).
[0047] Region identification: Scan the energy envelope and identify regions with a duration exceeding T. dur (e.g., 15 seconds) and the energy remains above T. energy The continuous region. Usually, the two longest sections (often corresponding to the chorus or climax) are selected as the core region for subsequent processing. This is more versatile than a fixed selection of 85-100 seconds.
[0048] 2. Adaptive framing and dynamic step size calculation (corresponding to step S1) Within the core area, this invention abandons the traditional fixed frame length and fixed overlap rate, and instead dynamically adjusts the sampling step size according to the intensity of the signal content.
[0049] S11: Spectral flux calculation: Set the basic analysis window length to L win (e.g., 20ms). For the analysis frame at the current time t, first perform an FFT transform to obtain the amplitude spectrum |X i (k)∣。 Calculate the spectral flux SF of the current frame relative to the previous frame, preferably using the L2 norm (squared difference) formula to amplify the difference in abrupt signals: Where N is the number of FFT points (e.g., 1024 or 2048), and k is the frequency index.
[0050] In other implementations, the L1 norm (absolute difference) or flux calculation methods after half-wave rectification (which only calculates the added energy) can also be used.
[0051] S12: Threshold setting strategy: To accommodate audio volumes of varying levels, the high and low thresholds for free space (SF) should not be set to a fixed constant, but rather an adaptive statistical strategy should be employed. Calculate the average spectral flux μ of the first 5 seconds of the current audio segment. sf and standard deviation σ sf .
[0052] Set a low throughput threshold T low =μ sf +0.5σ sf(Corresponding to a stable background); Set high throughput threshold T high =μ sf +2σ sf ; (corresponding to transient change).
[0053] S13: Dynamic adjustment of step size: According to SF i Determine the sampling step size Step for the next frame (i.e., the time difference between the start point of the current frame and the start point of the next frame): drastic change zone (SF) i >T high ): Set Step=H min (For example, 10ms). At this time, the overlap rate is extremely high (e.g., if the frame length is 20ms and the step size is 10ms, the overlap rate is 50%; if the frame length is 37ms, the overlap rate will be even higher), the purpose of which is to not miss any transient details.
[0054] Stable Zone (SF) i <T low ): Set Step=H max (For example, 40ms~60ms). At this time, the overlap rate is extremely low or even negative (frame skipping), the purpose of which is to significantly compress the amount of data in the information redundancy area.
[0055] Transition zone (T) low ≤SF i ≤T high ): Use linear interpolation for smooth transition: Step=H min +(H max -H min (T) high -SF i ) / (T high -T low ).
[0056] 3. Feature extraction and sparsification (preprocessing step S2) After extracting audio frames at positions determined by the aforementioned dynamic step size, frequency domain feature extraction is performed: Bark banding: The frequency band from 300Hz to 2000Hz, which is sensitive to human hearing, is divided into 33 critical bands according to the Bark scale. The logarithmic energy of each sub-band is calculated to form a 33-dimensional eigenvector.
[0057] Top-M sparsification: To improve noise immunity, the energy values of these 33 subbands are sorted, and only the top M with the highest energy (e.g., M=5 or 8) are retained, while the values of the remaining 33−M subbands are set to 0.
[0058] Technical effect: This step simulates the masking effect of the human ear, where loud sounds mask soft sounds. Preserving strong features effectively combats background white noise and high-frequency distortion introduced by MP3 compression.
[0059] 4. Fingerprint generation based on cache pool (corresponding to step S2) Traditional algorithms output fingerprints for each frame, resulting in a massive amount of data. This embodiment introduces a feature caching pool for temporal integration.
[0060] S21: Caching logic: The system maintains a cache pool. For a newly arrived sparsed feature vector V... curr Calculate its relationship with the first frame vector V in the cache pool. head The similarity distance.
[0061] The preferred similarity distance here is the reciprocal of the Euclidean distance or the cosine distance.
[0062] Continue caching: If Distance(V) curr V head )≤D th (Difference threshold) and cache duration <T max (e.g., 500ms) indicates that the current frame still belongs to the same acoustic event (e.g., the same long piano note), and V curr Add to the cache pool.
[0063] Trigger generation: If Distance(V) curr V head )>D th Or cache duration ≥ T max This indicates that the acoustic event has changed or the window is full. At this point, an aggregation operation is performed.
[0064] S22: Aggregation and Encoding: Aggregation: Calculate the mean of all vectors in the cache pool to obtain V. avg .
[0065] Differential hashing: This will hash the current V... avg V generated in the previous time period prev Perform a dimension-by-dimensional comparison.
[0066] For the j-th frequency band (j=1...33): If V avg [j]>V prev [j], the j-th fingerprint is set to 1; otherwise, the j-th fingerprint is set to 0.
[0067] This generates a 33-bit binary fingerprint unit. This differential encoding utilizes the relative variation trend of energy, making the fingerprint insensitive to the absolute magnitude of the volume and exhibiting volume invariance.
[0068] 5. Audio clustering (corresponding to step S3) The audio files are clustered based on the generated fingerprint sequences. Due to the use of an adaptive step size, the fingerprint sequence lengths may differ for different audio files. The clustering process is as follows: S31: Distance Calculation: The difference between two fingerprint units is calculated using Hamming distance. For the overall fingerprint sequence, since the lengths vary, it is preferable to use a sliding window or dynamic time warping (DTW) algorithm to find the best matching path and calculate the average bit error rate on the path.
[0069] S32: Clustering Construction: Set a matching threshold (e.g., BER < 0.3). Construct an undirected graph, treating each audio segment as a node. If the BER of two audio segments is below the threshold, establish a connection edge.
[0070] Traverse the graph structure and extract all connected components. Each connected component represents a cluster, and the audio within the cluster is considered to be different versions of the same song (such as the original, cover, live version, or versions with different compression rates).
[0071] This embodiment corresponds to the above method and provides an apparatus that includes the following logic modules (these modules can be implemented by software code running on a processor, or by FPGA / ASIC hardware): Adaptive framing module: Used to calculate spectral flux in real time and control the sampling rate of the audio stream according to the "flux-step negative correlation model". This module has an embedded threshold adaptive calculation unit.
[0072] The fingerprint generation module includes: Feature extraction unit: performs FFT, Bark banding, and Top-M sparsification.
[0073] Cache control unit: Maintains the feature cache pool, performs similarity comparison, and determines whether the window integration conditions are met.
[0074] Hash encoding unit: Performs time-domain difference comparison and outputs a binary stream.
[0075] Clustering module: Used to store a massive fingerprint database and perform clustering analysis based on Hamming distance and graph theory algorithms.
[0076] Experiments were conducted on a test set containing 10,000 songs to compare the traditional fixed-step algorithm (10ms step size) with the adaptive algorithm of this invention: Data volume comparison: The total size of the fingerprint generated by this invention is reduced by approximately 65%. This is because the step size of the stable region is increased to 40-60ms, and the cache pool further merges redundant frames.
[0077] Clustering time: Due to the reduction in data volume, the time spent on full-database clustering comparison was reduced by approximately 60%.
[0078] Accuracy (F1-Score): Clustering accuracy improved from 92.5% to 94.8%. The improvement is mainly attributed to Top-M sparsity, which removes noise interference, and the adaptive step size, which preserves richer details in the transient region (high-throughput region) than traditional algorithms.
[0079] The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which this invention pertains may make various modifications or additions to the described specific embodiments or use similar methods to substitute them, without departing from the spirit of the invention or exceeding the scope defined by the appended claims.
[0080] Although this document uses various terms extensively, the possibility of using other terms is not excluded. These terms are used merely for the convenience of describing and explaining the essence of the invention; interpreting them as any additional limitation would contradict the spirit of the invention.
Claims
1. A method of audio clustering based on adaptive audio fingerprints, characterized in that, The method comprises the following steps: S1: performing adaptive frame processing on an audio signal to obtain an audio frame sequence; The adaptive frame processing comprises: taking a preset initial time point as a starting point, calculating a spectral flux of a current analysis frame, and determining a sampling step of a next frame based on the spectral flux of the current analysis frame; wherein the spectral flux of the current analysis frame and the sampling step of the next frame are in a negative correlation relationship; S2: generating a fingerprint unit based on the audio frame sequence, specifically: establishing a feature cache pool, and sequentially storing feature vectors of consecutive audio frames in the audio frame sequence into the feature cache pool; when the feature cache pool meets a preset window integration condition, performing aggregation calculation on all feature vectors in the feature cache pool to generate a fingerprint unit representing a time period, and emptying the feature cache pool; S3: constructing an audio fingerprint based on the generated fingerprint units, and calculating the similarity between audios based on the audio fingerprint to cluster the audio signals.
2. The audio clustering method based on adaptive audio fingerprint according to claim 1, characterized in that, The specific formula for calculating the spectral flux of the current analysis frame in step S1 is: ; where SF i represents the spectral flux of the i-th current analysis frame, N represents the total number of frequency points, and ∣X i (k)∣ represents the amplitude spectrum of the i-th frame at the k-th frequency point, and ∣X i-1 (k)∣ represents the amplitude spectrum of the previous frame of the i-th frame at the k-th frequency point.
3. The audio clustering method based on adaptive audio fingerprint according to claim 1 or 2, characterized in that, The specific process of dynamically determining the sampling step of the next frame based on the spectral flux in step S1 comprises: preset minimum step size H min , maximum step size H max , high flux threshold T high , and low flux threshold T low ; acquiring a spectral flux SF of the current analysis frame i ; If SF i > T high , then set the sampling step Step of the next frame as H min ; If SF i <T low , then set the sample step Step of the next frame as H max ; If T low ≤ SF i ≤ T high , then the sampling step Step of the next frame is calculated according to a linear interpolation strategy, and the calculation formula is: Step = H min +(H max -H min )(T high -SF i ) / (T high -T low ).
4. The audio clustering method based on adaptive audio fingerprinting of claim 1, wherein, Before sequentially storing the feature vectors of consecutive audio frames in the audio frame sequence into the feature cache pool in step S2, a feature extraction step is further included: A1: performing windowing processing and fast Fourier transform (FFT) on each frame in the audio frame sequence; A2: dividing a frequency range of 300Hz to 2000Hz into 33 sub-bands according to Bark scale; A3: calculating the energy value of each sub-band to construct a 33-dimensional feature vector of the frame.
5. The audio clustering method based on adaptive audio fingerprinting according to claim 4, characterized in that, After calculating the energy value of each sub-band, a sparse processing step is further included: sorting the energy values of the 33 sub-bands; only keeping the energy data of the first M sub-bands with the largest energy values, and setting the energy data of the remaining sub-bands to zero or discarding them, wherein M is a positive integer less than 33; performing subsequent storage into the cache pool operation based on the feature vector after sparse processing.
6. The audio clustering method based on adaptive audio fingerprinting according to claim 4 or 5, characterized in that, In step S2, the preset window integration condition comprises the following logic: calculating a similarity distance between the current audio frame feature vector V curr and the first frame feature vector V head in the feature cache pool; if the similarity distance is greater than a preset difference threshold, or the accumulated time length in the feature cache pool exceeds a preset maximum window length, it is determined that the window integration condition is met; If the similarity distance is less than or equal to the preset difference threshold, and the accumulated length of time in the feature cache pool does not exceed the preset maximum window length, it is determined that the window integration condition is not met, and the V curr is stored in the feature cache pool.
7. The adaptive audio fingerprint based audio clustering method of claim 6, wherein, In step S2, the aggregation calculation on the feature vectors in the feature cache pool to generate a fingerprint unit representing the time period comprises: calculating the average value of all feature vectors in the feature cache pool in each dimension to obtain an average feature vector; hash encoding the average feature vector to generate the fingerprint unit.
8. The audio clustering method based on adaptive audio fingerprinting according to claim 7, characterized in that, The specific steps of hash encoding the average feature vector comprise: obtaining the average feature vector of the fingerprint unit generated in the last time period as a reference vector; comparing each dimension of the average feature vector of the current fingerprint unit with the corresponding dimension of the reference vector one by one; if the value of the current dimension is greater than the value of the corresponding reference dimension, the binary bit of the dimension is marked as "1", otherwise it is marked as "0", thereby generating a binary fingerprint sequence composed of "0" and "1".
9. The adaptive audio fingerprint based audio clustering method of claim 1, wherein, Before step S1, there is also a step of determining the audio effective fingerprint extraction region: Calculate the energy envelope of the full length of the audio signal; Identify the continuous region in the energy envelope which is continuously higher than the average energy by a preset proportion as the core region; Only perform the steps S1 to S3 on the core region to exclude the silence or invalid information at the beginning and end of the audio.
10. The adaptive audio fingerprint based audio clustering method of claim 1, wherein, In step S3, the similarity between the audios is calculated based on the audio fingerprints, specifically including: Measure the similarity by calculating the Hamming distance between the two audio fingerprint sequences; If the Hamming distance is less than a preset matching threshold, it is determined that the two audio signals belong to the same cluster.