Method, device and equipment for time alignment of multi-view heterogeneous sensor data and medium
By extracting audio segments as feature templates from multi-view heterogeneous sensor data and utilizing cross-correlation operations and time clipping methods, the problem of lightweighting and automation of time alignment of multi-view heterogeneous sensor data in existing technologies is solved, achieving high-precision time alignment, which is suitable for unified processing of multiple devices and multiple types of sensors.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN WUJIE ZHIHANG TECHNOLOGY CO LTD
- Filing Date
- 2026-04-03
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies cannot achieve lightweight, automated, high-precision, multi-view heterogeneous sensor data time-series alignment. Especially in field or lightweight scenarios, hardware synchronization methods increase system complexity, while manual alignment methods require a lot of human intervention and are difficult to scale to massive amounts of data.
By extracting continuous audio segments from the reference audio signal as feature templates, cross-correlation operations are used to determine the time delay of the non-reference audio signal, and time clipping is performed on the local valid data intervals of each acquisition device based on the time delay and the common valid time window, thus achieving time alignment of multi-view heterogeneous sensor data.
It achieves high-precision alignment of multiple devices and multiple types of sensors under a unified time reference, adapts to scenarios where the effective range of data from different acquisition devices is inconsistent, improves the time alignment accuracy, and reduces system deployment costs and operational complexity.
Smart Images

Figure CN122247546A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a method, apparatus, device and medium for time-series alignment of multi-view heterogeneous sensor data. Background Technology
[0002] With the rapid development of wearable computing, intelligent health monitoring, human-computer interaction, and immersive media, multi-device collaborative sensing systems are becoming increasingly common. In typical application scenarios, users usually wear multiple wearable sensors to simultaneously collect heterogeneous sensor data from different perspectives, including video, audio, motion, and physiological data. However, in actual data acquisition, because each wearable device typically uses an independent local clock and lacks a hardware-level synchronization mechanism, the collected data stream generally suffers from unknown time offsets and sampling frequency drift. This not only causes misalignment of data across different modalities on the timeline but also makes it difficult to determine the effective data intersection between multiple devices, severely impacting subsequent multimodal fusion, event labeling, and model training.
[0003] Currently, there are two main types of commonly used timing alignment methods: one is a hardware synchronization method that relies on an external timecode generator or synchronization cable. While this method offers high accuracy, it increases system complexity and the burden on the wearer, making it unsuitable for outdoor or lightweight scenarios. The other is a manual alignment method that generates synchronization events through clapping or striking a slab. This method requires significant manual intervention and is difficult to scale to massive amounts of data. Therefore, neither of these timing alignment methods can achieve lightweight, automated, and high-precision timing alignment. Summary of the Invention
[0004] In view of this, this application provides a method, apparatus, device and medium for timing alignment of multi-view heterogeneous sensor data, the main purpose of which is to solve the problem that the timing alignment methods of the prior art cannot achieve lightweight, automated and high-precision timing alignment.
[0005] According to a first aspect of this application, a method for temporal alignment of multi-view heterogeneous sensing data is provided, the method comprising: A continuous audio segment is extracted from a reference audio signal as a feature template. The reference audio signal is one of multiple audio signals, which are acquired by a distributed acquisition device. For each non-reference audio signal, the time delay of the non-reference audio signal relative to the reference audio signal is determined by cross-correlation operation within a preset search window; Based on the time delay, the local valid data intervals corresponding to each acquisition device are converted to a unified time coordinate system, and the temporal intersection of the converted valid data intervals is calculated to obtain a common valid time window; Based on the time delay and the common effective time window, the time of the multi-view heterogeneous sensing data corresponding to each acquisition device is clipped to obtain time-aligned multi-view heterogeneous sensing data.
[0006] Furthermore, the step of extracting continuous audio segments from the reference audio signal as feature templates specifically includes: Acquire multi-view heterogeneous sensor data concurrently collected by distributed acquisition devices. The multi-view heterogeneous sensor data includes multiple video files, each of which contains an audio track. Extract the corresponding audio tracks from each video file to obtain multiple audio signals; One of the multiple audio signals is used as a reference audio signal, and continuous audio segments are extracted from the reference audio signal to obtain a feature template. The step of using one audio signal from the multiple audio signals as a reference audio signal, and extracting continuous audio segments from the reference audio signal to obtain a feature template, specifically includes: Based on at least one of signal energy and effective speech segment length, one audio signal is selected from the multiple audio signals as a reference audio signal; A statistically significant continuous audio segment is extracted from the reference audio signal as a feature template, the statistical significance being determined by the difference in energy or spectrum between it and its noise baseline.
[0007] Furthermore, after extracting the corresponding audio tracks from each video file to obtain multiple audio signals, the method further includes: The sampling frequency of the multi-channel audio signals is uniformly processed and the amplitude is normalized to obtain standardized multi-channel audio signals. Accordingly, one audio signal from the standardized multi-channel audio signal is used as a reference audio signal, and continuous audio segments are extracted from the reference audio signal to obtain a feature template.
[0008] Furthermore, after performing time clipping on the multi-view heterogeneous sensing data corresponding to each acquisition device based on the time delay and the common effective time window to obtain time-aligned multi-view heterogeneous sensing data, the method further includes: Based on the time-aligned multi-view heterogeneous sensor data analysis of time synchronization deviation, and using the time synchronization deviation feedback to adjust the exposure trigger timing of the distributed acquisition device, the synchronous exposure control of the distributed acquisition device is realized.
[0009] Furthermore, the step of determining the time delay of each non-reference audio signal relative to the reference audio signal through cross-correlation calculation within a preset search window specifically includes: For each non-reference audio signal, calculate the cross-correlation function between the non-reference audio signal and the feature template within a preset search window; Locate the peak position in the cross-correlation function; The time delay of the non-reference audio signal relative to the reference audio signal is determined based on the offset between the peak position and the starting position of the feature template in the reference audio signal.
[0010] Furthermore, the step of converting the local valid data intervals corresponding to each acquisition device to a unified time coordinate system based on the time delay, and calculating the temporal intersection of the converted valid data intervals to obtain a common valid time window, specifically includes: Obtain the start and end times of valid data from each acquisition device under the local timestamp to form a local valid data range; Based on the time delay, the local valid data interval is mapped to a unified time coordinate system to obtain the global valid data interval corresponding to each acquisition device. The unified time coordinate system is based on the time axis of the reference audio signal. Calculate the temporal intersection of all globally valid data intervals to obtain the common valid time window.
[0011] Furthermore, the step of performing time clipping on the multi-view heterogeneous sensing data corresponding to each acquisition device based on the time delay and the common effective time window to obtain time-aligned multi-view heterogeneous sensing data specifically includes: For each acquisition device, based on the time delay, the common effective time window is converted into a clipping time interval of the acquisition device in the local time coordinate system, with the local time coordinate system based on the local time axis of each acquisition device; Based on the aforementioned clipping time interval, multi-view heterogeneous sensor data corresponding to each acquisition device are synchronously captured. The multi-view heterogeneous sensor data captured by all acquisition devices are combined to form time-aligned multi-view heterogeneous sensor data.
[0012] According to a second aspect of this application, a time-series alignment device for multi-view heterogeneous sensing data is provided, the device comprising: The segmentation unit is used to extract continuous audio segments from a reference audio signal as feature templates. The reference audio signal is one audio signal among multiple audio signals, which are acquired by a distributed acquisition device. The determining unit is used to determine the time delay of each non-reference audio signal relative to the reference audio signal by performing cross-correlation calculation within a preset search window for each non-reference audio signal. The calculation unit is used to convert the local valid data intervals corresponding to each acquisition device to a unified time coordinate system based on the time delay, calculate the temporal intersection of the converted valid data intervals, and obtain a common valid time window. The clipping unit is used to clip the time of the multi-view heterogeneous sensor data corresponding to each acquisition device based on the time delay and the common effective time window, so as to obtain time-aligned multi-view heterogeneous sensor data.
[0013] Furthermore, the interception unit includes: The acquisition module is used to acquire multi-view heterogeneous sensor data concurrently acquired by distributed acquisition devices. The multi-view heterogeneous sensor data includes multiple video files, and each video file contains an audio track. The extraction module is used to extract the corresponding audio tracks from each video file to obtain multiple audio signals. The interception module is used to take one of the multiple audio signals as a reference audio signal, and intercept continuous audio segments from the reference audio signal to obtain a feature template. The interception module is specifically used for: Based on at least one of signal energy and effective speech segment length, one audio signal is selected from the multiple audio signals as a reference audio signal; A statistically significant continuous audio segment is extracted from the reference audio signal as a feature template, the statistical significance being determined by the difference in energy or spectrum between it and its noise baseline.
[0014] Furthermore, the interception unit also includes: The processing module is used to extract the corresponding audio tracks from each video file to obtain multiple audio signals, and then perform sampling frequency unification and amplitude normalization on the multiple audio signals to obtain standardized multiple audio signals. Correspondingly, the interception module is also used to take one audio signal from the standardized multi-channel audio signal as a reference audio signal, and intercept continuous audio segments from the reference audio signal to obtain a feature template.
[0015] Furthermore, the device also includes: The control unit is used to perform time clipping on the multi-view heterogeneous sensor data corresponding to each acquisition device based on the time delay and the common effective time window to obtain time-aligned multi-view heterogeneous sensor data. Then, it analyzes the time synchronization deviation based on the time-aligned multi-view heterogeneous sensor data and uses the time synchronization deviation feedback to adjust the exposure triggering sequence of the distributed acquisition device to achieve synchronous exposure control of the distributed acquisition device.
[0016] Furthermore, the determining unit is specifically used for: For each non-reference audio signal, calculate the cross-correlation function between the non-reference audio signal and the feature template within a preset search window; Locate the peak position in the cross-correlation function; The time delay of the non-reference audio signal relative to the reference audio signal is determined based on the offset between the peak position and the starting position of the feature template in the reference audio signal.
[0017] Furthermore, the computing unit is specifically used for: Obtain the start and end times of valid data from each acquisition device under the local timestamp to form a local valid data range; Based on the time delay, the local valid data interval is mapped to a unified time coordinate system to obtain the global valid data interval corresponding to each acquisition device. The unified time coordinate system is based on the time axis of the reference audio signal. Calculate the temporal intersection of all globally valid data intervals to obtain the common valid time window.
[0018] Furthermore, the trimming unit is specifically used for: For each acquisition device, based on the time delay, the common effective time window is converted into a clipping time interval of the acquisition device in the local time coordinate system, with the local time coordinate system based on the local time axis of each acquisition device; Based on the aforementioned clipping time interval, multi-view heterogeneous sensor data corresponding to each acquisition device are synchronously captured. The multi-view heterogeneous sensor data captured by all acquisition devices are combined to form time-aligned multi-view heterogeneous sensor data.
[0019] According to a third aspect of this application, a timing alignment device for multi-view heterogeneous sensor data is provided, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor. When the processor executes the program, it implements the aforementioned timing alignment method for multi-view heterogeneous sensor data.
[0020] According to a fourth aspect of this application, a storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the above-described method for timing alignment of multi-view heterogeneous sensing data.
[0021] By employing the above technical solution, this application provides a method, apparatus, device, and medium for time alignment of multi-view heterogeneous sensor data. Compared with current methods that use hardware synchronization or manual alignment to achieve time alignment of multi-view heterogeneous sensor data, this application extracts continuous audio segments from a reference audio signal as feature templates. The reference audio signal is one of multiple audio signals, which are acquired by distributed acquisition devices. For each non-reference audio signal, cross-correlation is performed within a preset search window to determine the time delay of the non-reference audio signal relative to the reference audio signal. Based on the time delay, the local valid data intervals corresponding to each acquisition device are transformed to a unified time coordinate system, and the temporal intersection of the transformed valid data intervals is calculated to obtain a common valid time window. Based on the time delay and the common valid time window, the multi-view heterogeneous sensor data corresponding to each acquisition device is time-trimmed to obtain time-aligned multi-view heterogeneous sensor data. The entire process achieves accurate estimation of time delay across multiple devices through audio cross-correlation. Based on this, the local effective data ranges of each acquisition device are unified to the same time coordinate system, ensuring that the data used subsequently are common data collected simultaneously and effectively by multiple devices. Furthermore, the time window and time delay formed by the common data are combined to perform time clipping on the multi-view heterogeneous sensor data. This enables the alignment of multiple devices and multiple types of sensors under a unified time reference. It not only effectively solves the problem of insufficient accuracy when multiple acquisition devices use hardware synchronization for time alignment, but also adapts to scenarios where the effective data ranges of different acquisition devices are inconsistent, significantly improving the time alignment accuracy of multi-view heterogeneous sensor data.
[0022] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the following are specific embodiments of this application. Attached Figure Description
[0023] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings: Figure 1 This is a flowchart illustrating a method for temporal alignment of multi-view heterogeneous sensor data in one embodiment of this application. Figure 2 yes Figure 1 A flowchart illustrating a specific implementation method of step 101; Figure 3 yes Figure 2 A flowchart illustrating a specific implementation method for step 203; Figure 4 yes Figure 1 A flowchart illustrating another specific implementation of step 101; Figure 5 yes Figure 1 A flowchart illustrating a specific implementation method for step 102; Figure 6 yes Figure 1 A schematic diagram of a specific implementation method for step 103; Figure 7 yes Figure 1 A flowchart illustrating a specific implementation method for step 104; Figure 8 This is a schematic diagram of the structure of a time-series alignment device for multi-view heterogeneous sensor data in one embodiment of this application; Figure 9 This is a schematic diagram of the device structure of a computer device provided in an embodiment of the present invention. Detailed Implementation
[0024] The present application will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in the embodiments of the present application can be combined with each other.
[0025] In related technologies, there are two main types of commonly used timing alignment methods: one is a hardware synchronization method that relies on an external timecode generator or synchronization cable. Although this method has high accuracy, it increases system complexity and wearing burden, making it unsuitable for outdoor or lightweight scenarios. The other is a manual alignment method that generates synchronization events by clapping or striking a board. This method requires a lot of manual intervention and is difficult to scale to massive amounts of data. Therefore, neither of these timing alignment methods can achieve lightweight, automated, and high-precision timing alignment.
[0026] To address this issue, this embodiment provides a method for temporal alignment of multi-view heterogeneous sensor data, such as... Figure 1 As shown, it includes the following steps: 101. Extract continuous audio segments from the reference audio signal as feature templates.
[0027] In this system, the reference audio signal is one of multiple audio signals, each acquired by a distributed acquisition device. These devices are deployed separately and acquire data independently. Under the same acquisition scenario, corresponding audio signals are generated, resulting in multiple audio signals from different sources with independent acquisition sequences. This process can be applied in scenarios where distributed acquisition devices collaboratively acquire data. For example, when testers wear multiple acquisition devices for motion monitoring, health data acquisition, or on-site status perception, each device typically acquires multiple types of sensor data simultaneously, such as audio, acceleration, and angular velocity. These acquisition devices can be wearable, such as smart bracelets, wearable cameras, smart badges, or motion monitoring wristbands, which can be worn close to the body and distributed across different test subjects or different parts of the same subject to independently acquire heterogeneous data such as audio, heart rate, acceleration, and video. Alternatively, these acquisition devices can be portable, such as handheld acquisition terminals, mobile recording devices, or portable sensor nodes, which can be flexibly deployed in different locations within the acquisition scenario without fixed installation, meeting the requirements of distributed deployment and independent acquisition.
[0028] To achieve time matching between multiple signals, a unified reference standard needs to be determined among the multiple audio signals. This application selects one of the multiple audio signals as the reference audio signal. This reference signal can be audio data acquired by any acquisition device, such as the audio signal corresponding to a device worn by a specific subject, a device with clearer signal quality, or a device located in a central position.
[0029] Furthermore, after determining the reference audio signal, this embodiment can extract a continuous audio segment from within the reference audio signal as a feature template. For example, in a motion scene, a continuous audio segment containing footsteps, collision sounds, ambient background noise, and other sounds with obvious temporal characteristics can be selected. This segment can reflect the typical characteristics of the reference signal, facilitating similarity comparison with audio signals from other devices. By extracting real, continuously acquired audio as a feature template, rather than artificially constructed signals, the accuracy and scene adaptability of subsequent cross-correlation calculations can be improved, providing a stable and reliable matching basis for accurately calculating the time delay of each non-reference audio signal relative to the reference audio signal.
[0030] 102. For each non-reference audio signal, the time delay of the non-reference audio signal relative to the reference audio signal is determined by cross-correlation operation within a preset search window.
[0031] Understandably, to achieve time synchronization among multiple acquisition devices, it is necessary to accurately calculate the time delay between each non-reference audio signal and the reference audio signal. This time delay essentially reflects the offset between the corresponding acquisition device and the reference acquisition device in the audio acquisition timing, serving as a core prerequisite for subsequent multi-device data time alignment. The non-reference audio signal refers to all other audio signals acquired by the multiple acquisition devices that are not selected as the reference audio signal; each non-reference audio signal corresponds to an independent acquisition device.
[0032] In this embodiment, cross-correlation, a classic and reliable method in signal processing for measuring the similarity between two signals, can effectively capture the matching degree of two audio signals at different time offsets, thereby accurately locating the time difference between the non-reference audio signal and the reference audio signal. Specifically, cross-correlation calculates the correlation coefficient between the feature template and each non-reference audio signal. When the correlation coefficient reaches its maximum value, the corresponding time offset is the time delay of the non-reference audio signal relative to the reference audio signal. This calculation method has strong resistance to environmental noise interference, can adapt to the audio acquisition needs of the acquisition device in complex scenarios, and ensures the accuracy of time delay estimation.
[0033] To further improve the efficiency and accuracy of time delay calculation, this embodiment of the invention introduces a preset search window as a constraint during the cross-correlation operation. The preset search window is a time range pre-set based on the acquisition characteristics of the acquisition device and the actual needs of the application scenario. Its setting is mainly based on factors such as the start-up time difference range of the acquisition device, sampling frequency deviation, and sound propagation time difference caused by the device deployment distance. It is typically set to a reasonable range that can cover all possible time delays, avoiding both inefficient calculations due to an excessively large search range and overlooking true time delays due to an excessively small search range.
[0034] To illustrate with a specific application scenario, the tester wore three wearable devices: Device A, Device B, and Device C. Audio signals were simultaneously collected during movement. The audio signal collected by Device A was selected as the reference audio signal, and a continuous segment containing footsteps was extracted from it as a feature template. The audio signals collected by Devices B and C were the non-reference audio signals. Because Device B started up 0.2 seconds later than Device A, and Device C started up 0.1 seconds earlier than Device A, there was a slight difference in the startup time of the three devices. Furthermore, the different wearing positions of the devices caused subtle time differences in sound propagation. Cross-correlation calculations were needed to determine the time delay between the non-reference audio signals of Devices B and C and the reference audio signal of Device A.
[0035] At this point, for the non-reference audio signal corresponding to device B, the preset search window is set to [-0.5 seconds, 0.5 seconds]. This preset search window covers the start-up time difference and propagation time difference as much as possible. The feature template and the non-reference audio signal of device B are cross-correlated within the preset search window, time by time. When the correlation coefficient is maximized, the corresponding time offset is 0.2 seconds, meaning the time delay of the non-reference audio signal of device B relative to the reference audio signal is determined to be 0.2 seconds. Similarly, for the non-reference audio signal corresponding to device C, cross-correlation is performed within the same preset search window. The time offset corresponding to the maximum correlation coefficient is -0.1 seconds, meaning the time delay of the non-reference audio signal of device C relative to the reference audio signal is determined to be -0.1 seconds. It should be noted that the negative sign here indicates that the audio acquisition timing of device C is earlier than that of the reference device A.
[0036] 103. Based on the time delay, convert the local valid data intervals corresponding to each acquisition device to a unified time coordinate system, calculate the temporal intersection of the converted valid data intervals, and obtain a common valid time window.
[0037] Since different acquisition devices typically use their own independent local clocks for data acquisition, the start time, end time, and duration of their valid data often differ significantly. Even if relative timing calibration between signals is achieved through time delays, the valid data intervals of each device are still based on their local timelines and cannot be directly used for joint processing of multi-device data. Therefore, it is necessary to unify the valid data intervals scattered under different local clocks to a common time base for comparison, thereby determining the time period during which all acquisition devices are simultaneously in a valid acquisition state.
[0038] Specifically, during the conversion to a unified time coordinate system, the timestamp of the acquisition device corresponding to the reference audio signal can be used as a benchmark. Based on the time delay of each acquisition device corresponding to the reference audio signal, the start and end times of its local valid data interval are shifted and corrected, ensuring that the valid data intervals of all acquisition devices are represented on the same time axis. For the acquisition device corresponding to the reference audio signal, its time delay is zero, and its valid data interval remains unchanged. For other acquisition devices not corresponding to the reference audio signal, the local valid data interval is shifted forward or backward based on the sign of the time delay, aligning it with the timing of the acquisition device corresponding to the reference audio signal. After the conversion, each valid data interval no longer depends on its own local clock but has a unified timing reference, intuitively reflecting the valid acquisition period of each acquisition device on the same time axis.
[0039] Accordingly, after completing the unified time-series transformation of the valid data intervals, a time-series intersection calculation is performed on the transformed valid data intervals of all acquisition devices. The time-series intersection represents the overlapping portion of the valid data intervals of all devices. Only data falling within this overlapping time period can guarantee that each acquisition device is in a normal acquisition state and the data is valid. The time period obtained by calculating the time-series intersection is the common valid time window. This common valid time window can automatically filter out time periods where some acquisition devices are valid while other acquisition devices have not acquired data or the data is invalid, fundamentally avoiding data loss, misalignment, or anomalies caused by non-overlapping time sequences.
[0040] Taking motion monitoring as an example, when testers wear different data acquisition devices to collect data synchronously, the effective data ranges for each device are different. Assume the effective data range for the acquisition device corresponding to the reference audio signal is from second 2 to second 10 in a unified time coordinate system; the effective range for the first acquisition device corresponding to the non-reference audio signal, after time delay conversion, is from second 1 to second 9; and the effective range for the second acquisition device corresponding to the non-reference audio signal, after conversion, is from second 3 to second 11. Taking the temporal intersection of these three effective data ranges, the overlapping portion is from second 3 to second 9. This time period is the common effective time window for this data collection. Only data falling within this common effective time window can ensure that all acquisition devices are simultaneously in an effective acquisition state, thus providing an accurate and reliable basis for subsequent time clipping of multi-view heterogeneous sensor data.
[0041] 104. Based on the time delay and the common effective time window, time clipping is performed on the multi-view heterogeneous sensing data corresponding to each acquisition device to obtain time-aligned multi-view heterogeneous sensing data.
[0042] It should be noted that multi-view heterogeneous sensor data is not limited to the audio signals used for time delay estimation mentioned earlier, but also includes various sensor data such as acceleration, angular velocity, heart rate, and blood oxygen collected synchronously by different devices. Due to issues such as local clock deviations and asynchronous startup times among different acquisition devices, even after the conversion of the effective data range to a unified time coordinate system is completed, the heterogeneous sensor data corresponding to each device still have temporal misalignments, and some data fall outside the common effective time window and are invalid. If directly used for joint processing, it will lead to deviations in the analysis results. Therefore, precise alignment must be achieved through time clipping.
[0043] The execution logic of time trimming relies on the time delay and common effective time window mentioned earlier. The time delay is used to calibrate the timing offset of local sensor data from each acquisition device, ensuring that the start and end times of the data from each acquisition device are consistent with the reference device in a unified time coordinate system. The common effective time window defines a unique period during which all acquisition devices effectively acquire data, ensuring that the trimmed data is valid information synchronously acquired by multiple acquisition devices. Specifically, for each acquisition device, firstly, based on its corresponding time delay, all multi-view heterogeneous sensor data acquired locally by the acquisition device are shifted to a unified time coordinate system to correct timing deviations; then, using the common effective time window as a filtering criterion, sensor data within this window is extracted, and invalid data outside the window is removed, completing the sensor data trimming for a single device. The above time trimming process is entirely based on the time delay and common effective time window obtained earlier, requiring no manual intervention, exhibiting a high degree of automation, adapting to scenarios of parallel acquisition by multiple acquisition devices, and not relying on complex hardware synchronization modules. High-precision time alignment can be achieved solely through software-level timing calibration and data trimming, significantly reducing system deployment costs and operational complexity.
[0044] To illustrate with a specific application scenario, the tester wore three wearable devices: Device A, Device B, and Device C, simultaneously collecting heterogeneous sensor data of audio, acceleration, and heart rate. Device A serves as the reference audio signal acquisition device with a time delay of 0. Device B has a time delay of 0.2 seconds, meaning its acquisition timing is 0.2 seconds later than Device A's. Device C has a time delay of -0.1 seconds, meaning its acquisition timing is 0.1 seconds earlier than Device A's. The common effective time window calculated through the preceding steps is from the 3rd to the 9th second in a unified time coordinate system. For device A, since its time delay is 0, there is no need to perform time-shifting on the local heterogeneous sensor data. The audio, acceleration, and heart rate data from the 3rd to the 9th second are directly extracted, which constitutes the effective heterogeneous sensor data after cropping for device A. For device B, based on the 0.2-second time delay, all its local sensor data can be shifted backward by 0.2 seconds to correct timing deviations and align with the timing of reference device A. Then, all sensor data from the 3rd to the 9th second after the shift are extracted, and invalid data outside this window is discarded. For example... For example, audio data collected locally by device B from 2.8 seconds to 3 seconds, after being shifted, corresponds to 3 seconds to 3.2 seconds in a unified time coordinate system. This falls within a common valid time window and is retained. Audio data from device B before 2.8 seconds, however, remains outside this window and is discarded. For device C, based on a -0.1-second time delay, all its local sensor data is shifted forward by 0.1 seconds to correct timing discrepancies. Then, sensor data from 3 seconds to 9 seconds after the shift is extracted, completing the cropping. Correspondingly, after cropping the sensor data from all devices, the cropped multi-view heterogeneous sensor data from devices A, B, and C are integrated to obtain time-aligned multi-view heterogeneous sensor data. At this point, the audio, acceleration, and heart rate data from the three devices are completely synchronized within 3 seconds to 9 seconds in a unified time coordinate system. The sensor data from all devices at each time point are valid information collected at the same time. For example, the sensor data corresponding to the 5th second in the unified time coordinate system includes the audio, acceleration, and heart rate data collected by device A in the 5th second, the corresponding data collected by device B in the 5th second after translation (i.e., the local 4.8th second), and the corresponding data collected by device C in the 5th second after translation (i.e., the local 5.1st second). The timing of the three is completely consistent, and there is no misalignment, missing, or redundancy.
[0045] In practical applications, due to the fundamentally different sampling mechanisms of audio and video, direct acquisition can lead to keyframe loss or semantic incompleteness. This embodiment, after calculating the common effective time window, does not mechanically segment the data by time, but instead fine-tunes the clipping boundaries based on the semantic units of heterogeneous data to ensure semantic integrity. Specifically, a semantic boundary constraint can be added to the time delay and the common effective time window. This constraint allows querying the semantic unit attributes of each acquisition device near the time boundary, such as frame type, target ID continuity, and data packet integrity. Based on preset semantic integrity rules, the initial clipping region is fine-tuned to generate the final clipping region, ensuring that the final clipping region is semantically complete, self-contained, and intact.
[0046] Specifically, after calculating the common effective time window, a semantic index mapping table can be constructed for multi-view heterogeneous sensor data. Before time clipping, a lightweight metadata index can be performed for each heterogeneous sensor data stream. This metadata index only records key semantic information, including but not limited to: timestamps, frame types, and scene change markers for each frame of the video stream, as well as timestamps, voice activity detection status, and transient impact markers for the audio stream. Accordingly, during the time-trimming process of the multi-view heterogeneous sensor data corresponding to each acquisition device, the initial start time and initial end time of the common effective time window are obtained; for each heterogeneous sensor data stream, the semantic unit identification information in its data stream is parsed, where the semantic unit identification information includes at least the keyframe type of the video stream or the voice activity status information of the audio stream; based on the semantic unit identification information, within the preset neighborhood range of the initial start time and initial end time, boundary fine-tuning operation is performed: if the initial boundary is detected to have truncated a video frame that cannot be independently decoded or an incomplete target tracking trajectory, the corresponding boundary is moved to the nearest semantically complete boundary point; combining the fine-tuning results of each heterogeneous sensor data stream, the final effective time window is determined through a preset priority arbitration strategy; data trimming is performed based on the final effective time window.
[0047] The aforementioned priority arbitration strategy configuration may include: when the semantic integrity boundary adjustment requirements of the video stream conflict with those of the radar stream, if the current test scenario is marked as visual perception verification, the adjustment result of the video stream shall prevail; through this dynamic, content-aware boundary fine-tuning, the final generated common time window is not only synchronized on the timeline, but also achieves a high degree of integrity at the data semantic level.
[0048] The timing alignment method for multi-view heterogeneous sensor data provided in this application, compared with the current methods of hardware synchronization or manual alignment, extracts continuous audio segments as feature templates from a reference audio signal. The reference audio signal is one of multiple audio signals acquired by distributed acquisition devices. For each non-reference audio signal, cross-correlation is performed within a preset search window to determine the time delay of the non-reference audio signal relative to the reference audio signal. Based on the time delay, the local valid data intervals corresponding to each acquisition device are transformed to a unified time coordinate system, and the timing intersection of the transformed valid data intervals is calculated to obtain a common valid time window. Based on the time delay and the common valid time window, the multi-view heterogeneous sensor data corresponding to each acquisition device is time-trimmed to obtain time-aligned multi-view heterogeneous sensor data. The entire process achieves accurate estimation of time delay across multiple devices through audio cross-correlation. Based on this, the local effective data ranges of each acquisition device are unified to the same time coordinate system, ensuring that the data used subsequently are common data collected simultaneously and effectively by multiple devices. Furthermore, the time window and time delay formed by the common data are combined to perform time clipping on the multi-view heterogeneous sensor data. This enables the alignment of multiple devices and multiple types of sensors under a unified time reference. It not only effectively solves the problem of insufficient accuracy when multiple acquisition devices use hardware synchronization for time alignment, but also adapts to scenarios where the effective data ranges of different acquisition devices are inconsistent, significantly improving the time alignment accuracy of multi-view heterogeneous sensor data.
[0049] In practical applications, complete reference audio signals often contain a large number of invalid, stationary, or noise-dominated periods. Directly using the complete signal for cross-correlation calculations significantly increases the computational load and is prone to errors in time delay estimation due to noise interference or unclear features, failing to meet the application requirements of low-computing-power, high-precision acquisition devices. Selecting continuous audio segments as feature templates is essentially performing feature extraction and dimensionality reduction on the reference signal, which can accurately identify signal segments with significant time-domain or frequency-domain features, providing a comparison basis for subsequent cross-device signal matching. Specifically, in the above embodiments, such as Figure 2 As shown, step 101 includes the following steps: 201. Acquire multi-view heterogeneous sensor data concurrently collected by distributed acquisition devices.
[0050] 202. Extract the corresponding audio tracks from each video file to obtain multiple audio signals.
[0051] 203. Take one of the multiple audio signals as a reference audio signal, extract continuous audio segments from the reference audio signal, and obtain a feature template.
[0052] In this embodiment, concurrent acquisition refers to the distributed acquisition devices simultaneously starting acquisition work within the same application scenario and time period, ensuring that the acquired data corresponds to the same monitoring scenario or behavioral process. The corresponding multi-view heterogeneous sensor data refers to various types of sensor data collected by the distributed acquisition devices from different wearing positions and monitoring angles, not single-dimensional data. It explicitly includes multiple video files, and each video file corresponds to an audio track. This is because during the acquisition process, the video file simultaneously records visual and environmental sound information. Its built-in audio track has better scene correlation and temporal synchronization compared to the audio signal acquired separately by the acquisition device, accurately corresponding to the real-time scene during device acquisition. Furthermore, it eliminates the need for an additional audio acquisition module, reducing device deployment complexity.
[0053] In practical applications, testers wore wearable head cameras, wrist recorders, waist monitors, and foot sensors, all four devices operating simultaneously to monitor the testers' synchronized movements, such as running and jumping. The head camera captured video footage of facial and upper body movements, with a synchronized audio track recording ambient sounds, breathing, and footsteps. The wrist recorder captured video footage of hand movements, with an audio track simultaneously recording the sounds of clothing rubbing near the wrist and ambient background noise. The waist and foot sensors captured video footage and corresponding audio tracks, recording waist movements and foot landings, along with their corresponding sound signals. All the data collected by these four devices together constituted multi-view heterogeneous sensor data, with each video file's corresponding audio track accurately reflecting the real-time sound scene at the time of acquisition by that device.
[0054] Next, to extract the visual information from the video files, it is necessary to extract the audio signals that can be used for timing comparison. Since audio signals from different wearable devices may contain the same scene sounds in a concurrent multi-device acquisition scenario, these common sound features can serve as a basis for timing comparison across multiple acquisition devices. Specifically, existing audio extraction algorithms can be used to separate independent audio tracks from each video file, convert them to a standard audio format, and obtain multiple audio signals corresponding to each acquisition device. Each audio signal uniquely corresponds to one acquisition device, and its timing characteristics are completely synchronized with the local acquisition clock of that device, accurately reflecting the acquisition timing of the corresponding device.
[0055] Considering that continuous audio segments can be filtered to identify signal segments with distinct features and strong stability, these segments can provide a clear and reliable comparison benchmark for subsequent cross-correlation calculations, while significantly reducing the amount of computational data, thus meeting the lightweight processing requirements of acquisition devices. For example, in the aforementioned distributed acquisition scenario, the audio signal corresponding to the head-mounted wearable camera can be selected as the reference audio signal. This reference audio signal contains continuous footstep sound segments, which can be extracted as feature templates.
[0056] Specifically, in the above embodiments, such as Figure 3 As shown, step 203 above includes the following steps: 301. Based on at least one of signal energy and effective speech segment length, select one audio signal from the multiple audio signals as a reference audio signal.
[0057] 302. Extract statistically significant continuous audio segments from the reference audio signal as feature templates.
[0058] Understandably, signal energy and effective speech segment length are two key quantifiable indicators for measuring audio signal quality, and are preferred as the basis for selecting reference audio signals. They can be used individually or in combination, both enabling accurate selection of reference audio signals. Signal energy reflects the intensity of the audio signal. Audio signals with moderate and stable energy are less affected by environmental noise, have clearer waveform characteristics, and can effectively avoid problems such as the signal being masked by noise due to being too weak, or distortion due to being too strong. Effective speech segment length refers to the total length of continuous segments in the audio signal that have actual scene characteristics. The longer the effective segment, the richer the features available for matching in the signal, which can improve the flexibility of subsequent feature template extraction and the reliability of matching.
[0059] For example, devices 1, 2, and 3 concurrently acquire video and corresponding audio signals during a running process, extracting three audio signals from the three video streams. The signal energy and effective audio segment length of the three audio signals are as follows: Device 1's audio signal energy is stable between -20dB and -15dB, with a total effective footstep sound segment length of 8 seconds, accounting for 80% of the total acquisition time; Device 2's audio signal energy fluctuates between -30dB and -10dB, significantly affected by outdoor wind noise, with a total effective footstep sound segment length of only 3 seconds, accounting for 30% of the total acquisition time; Device 3's audio signal energy fluctuates between -40dB and -25dB, with a total effective footstep sound segment length of 5 seconds, accounting for 50% of the total acquisition time. At this point, if only signal energy is considered, the audio signal energy of device 1 is the most stable, so it is selected as the reference audio signal; if only the effective speech segment length is considered, the effective segment of device 1 is the longest, so it is selected as the reference audio signal; if both indicators are considered, device 1 is the best in both indicators, so it is selected as the reference audio signal to ensure that the reference signal has a high-quality matching basis.
[0060] The statistical significance mentioned above is determined by the difference in energy or spectrum between the extracted continuous audio segment and its noise baseline. Here, statistical significance means that the extracted continuous audio segment has clear, distinguishable characteristics and is significantly different from the noise portion of the reference audio signal.
[0061] It should be noted that the determination of statistical significance can be flexibly made by choosing between energy difference or spectrum difference, or by combining both: in scenarios with low noise interference, such as indoor monitoring, energy difference alone can be used for accurate determination; in scenarios with high noise interference and complex spectrum, such as outdoor noisy environments, energy difference and spectrum difference can be combined for determination.
[0062] Continuing with the example of the concurrent running scenario, after selecting the audio signal of device 1 as the reference audio signal, the noise baseline of the reference audio signal is first determined: extract the audio segment of the 1-second silent period at the beginning of the acquisition of device 1, calculate its average energy as -35dB, and the spectrum is mainly concentrated in the high frequency band above 2kHz. This segment is the noise baseline. Subsequently, continuous audio segments were screened from the reference audio signal and their statistical significance was tested: a continuous footstep sound segment with a duration of 2 seconds and an average energy of -18dB was selected. Its average energy of -18dB was significantly higher than the average energy of the noise baseline of -35dB, and its spectrum was mainly concentrated in the mid-low frequency band of 500Hz to 1kHz, which was significantly different from the high frequency spectrum distribution of the noise baseline. Therefore, the segment was determined to be statistically significant and was used as a feature template. If a segment with a duration of 1 second and an average energy of -32dB was selected, its energy was close to the noise baseline of -35dB, and its spectrum distribution was basically consistent with the noise baseline with no obvious feature difference. Therefore, the segment was determined to be not statistically significant, and the audio segment was discarded. A new continuous audio segment that met the requirements was selected.
[0063] It is understandable that distributed acquisition devices, due to differences in hardware models and acquisition parameter settings, may have audio acquisition modules using different sampling frequencies. To eliminate non-systematic errors caused by hardware differences between different acquisition devices and ensure the fairness and accuracy of subsequent signal comparisons, furthermore, such as... Figure 4 As shown, after step 202 above, the method further includes the following steps: 401. Perform sampling frequency unification and amplitude normalization on the multi-channel audio signals to obtain standardized multi-channel audio signals.
[0064] Accordingly, in step 203, one audio signal from the standardized multi-channel audio signal is used as a reference audio signal, and continuous audio segments are extracted from the reference audio signal to obtain a feature template.
[0065] Due to differences in hardware configuration, microphone sensitivity, and acquisition parameters among different devices, directly using the raw audio signals for cross-correlation calculations will result in significant computational deviations. In this embodiment, during the process of unifying the sampling frequency and normalizing the amplitude of multiple audio signals, frequency unification standardizes the audio signals from different devices to the same time-domain scale, avoiding timing misalignments caused by varying sampling point density. Amplitude normalization eliminates signal strength differences caused by microphone gain, wearing distance, and recording environment, preventing amplitude interference with feature matching. After these two standardization steps, all audio signals maintain consistency in data format, numerical range, and time resolution.
[0066] Correspondingly, after standardizing multiple audio signals, one signal can be selected as a reference audio signal. The purpose of the reference audio signal is to provide a unified timing reference for all acquisition devices. Typically, a signal with clear quality, low noise, and distinct characteristics is chosen. Since sampling frequency unification and amplitude normalization have already been completed, the reference audio signal now has a completely consistent format and numerical range with the other signals, ensuring the fairness and accuracy of subsequent cross-correlation calculations. For example, in a motion recording scenario with multiple wearable cameras, different devices may acquire audio at 44.1kHz, 16kHz, and 8kHz respectively, resulting in significant amplitude differences. After standardization, all signals are unified to a 16kHz sampling rate, and the amplitudes are mapped to the same range. At this point, any one of these signals can serve as a stable reference audio signal.
[0067] In practical applications, in distributed multi-camera acquisition systems, each acquisition device typically uses an independent clock source and clock drift. Existing hardware synchronization solutions are limited by transmission latency and device response speed, often achieving only a synchronization accuracy of around 20ms. However, in scenarios involving high-speed motion, rapid zooming, or high-frequency vibration, a 20ms error can lead to motion asynchrony.
[0068] To address the aforementioned issues, this embodiment proposes first utilizing the inherent synchronicity of ambient sound to perform correlation analysis on multi-view audio and video data recorded by distributed acquisition devices, calculating the relative time offset between each device. Here, the relative time offset can be understood as the difference in exposure start times. Subsequently, based on the relative time offset, timeline remapping and common window cropping are performed on each video stream. While this method cannot change the actual exposure time of the hardware, it can effectively compensate for time errors caused by asynchronous device startup and crystal oscillator drift at the later data processing level, thereby outputting logically strictly aligned multi-view video data and eliminating the fusion quality degradation problem caused by time asynchrony. Accordingly, after step 104, the above method further includes the following steps: Based on the time-aligned multi-view heterogeneous sensor data analysis of time synchronization deviation, and using the time synchronization deviation feedback to adjust the exposure trigger timing of the distributed acquisition device, the synchronous exposure control of the distributed acquisition device is realized.
[0069] Understandably, after obtaining time-aligned multi-view heterogeneous sensor data, feature analysis can be used to uncover the implicit time synchronization deviations within the data. These deviations refer not only to simple clock drift but also to microscopic time errors caused by hardware trigger jitter, transmission delay fluctuations, and inconsistent exposure start points. Since the input data has already undergone rigorous common window cropping, the deviations analyzed at this point can more accurately reflect systematic errors at the hardware level, rather than interference caused by dynamic scene changes. The time synchronization deviations can be quantified into specific error vectors or compensation parameters, serving as the basis for feedback control.
[0070] Specifically, if analysis reveals a lag between the actual exposure start point of the slave acquisition device and the master acquisition device, the trigger signal transmission time of the slave device can be advanced in the next acquisition cycle, or its internal exposure delay register can be adjusted. This adjustment is performed in real-time or near real-time, forming a closed-loop control circuit. Through continuous iterative adjustments, the physical exposure windows of each heterogeneous sensor can be forced to gradually converge, making them coincide as closely as possible with the ideal common time window at the hardware level.
[0071] In practical applications, due to the dispersed deployment and independent acquisition of distributed acquisition devices, and the lack of a unified clock among these devices, there will be start-up time differences, sampling deviations, and timing offsets. Direct data fusion will lead to timing misalignment. Therefore, it is necessary to perform similarity matching between each non-reference audio signal and the reference audio signal to quantify timing differences. Specifically, such as... Figure 5 As shown, step 102 above includes the following steps: 501. For each non-reference audio signal, calculate the cross-correlation function between the non-reference audio signal and the feature template within a preset search window.
[0072] 502. Locate the peak position in the cross-correlation function.
[0073] 503. Determine the time delay of the non-reference audio signal relative to the reference audio signal based on the offset between the peak position and the starting position of the feature template in the reference audio signal.
[0074] The aforementioned non-reference audio signals refer to all other audio signals acquired by multiple distributed acquisition devices that were not selected as reference audio signals. Each non-reference audio signal uniquely corresponds to one acquisition device. The cross-correlation function, a classic and reliable method in signal processing for measuring the similarity between two signals, works by calculating the correlation coefficient between the feature template and the non-reference audio signal at different time offsets. A higher correlation coefficient indicates a higher similarity between the two signals at the time offset position, and vice versa.
[0075] Understandably, to adapt to the deployment characteristics of distributed acquisition devices, the preset search window is a time range pre-set based on the actual sampling scenario of the distributed acquisition devices. This setting is based on, but is not limited to, the startup time difference range of each acquisition device, sampling frequency deviation, and sound propagation time difference caused by device deployment distance. It is typically set to a reasonable range that can cover all possible time delays. For example, if the distributed acquisition device consists of three wearable cameras deployed in an outdoor collaborative acquisition scenario, based on the device startup characteristics and deployment distance, the preset search window is set to [-0.8 seconds, 0.8 seconds] to ensure coverage of all possible timing offsets while controlling computational load to adapt to the lightweight computing power requirements of wearable devices.
[0076] The peak position of the aforementioned cross-correlation function corresponds to the time point offset where the non-reference audio signal and the feature template have the highest similarity, and is also key to achieving accurate time delay positioning. Since the feature template has significant time-domain or frequency-domain characteristics, there must be a segment in the non-reference audio signal acquired by the distributed acquisition device that is highly similar to the feature template, corresponding to a distinct peak in the cross-correlation function. This peak, unlike spurious peaks caused by noise interference, has high amplitude, strong distinguishability, and rapid decay of surrounding correlation coefficients. By setting a peak threshold, interference from spurious peaks can be eliminated, ensuring the accuracy of peak position positioning.
[0077] In this embodiment, the time delay essentially reflects the offset between the corresponding distributed acquisition device and the reference acquisition device in the audio acquisition timing, and is the core technical support for subsequent multi-device data time alignment. Because the distributed acquisition devices are deployed separately and acquire data independently, without a unified local clock, there are slight differences in the startup time and sampling frequency of each device. This causes the timing of the non-reference audio signal and the reference audio signal to be unable to synchronize naturally. Accurate estimation of the time delay is achieved through cross-correlation function calculation, peak location, and offset calculation. Since the starting position of the feature template in the reference audio signal is only used to define the coordinates of the feature template in the reference timing, and the cross-correlation function calculates the relative offset between the feature template and the non-reference audio signal, this relative offset directly reflects the timing difference between the non-reference audio signal and the reference audio signal, i.e., the time delay. For example, if the feature template starts at the 3rd second in device A, and the peak offset of the cross-correlation function in device B is 0.3 seconds, then the time delay of the non-reference audio signal in device B relative to the reference audio signal is 0.3 seconds, meaning that the audio acquisition timing of device B is 0.3 seconds later than that of device A. Similarly, if the peak offset of the cross-correlation function in device C is -0.2 seconds, then the time delay of the non-reference audio signal in device C relative to the reference audio signal is -0.2 seconds, meaning that the audio acquisition timing of device C is 0.2 seconds earlier than that of device A.
[0078] In practical applications, each acquisition device uses an independent local clock, resulting in misalignment of their effective data intervals on the time axis, making them unsuitable for direct multi-device data fusion. Converting the local effective data intervals of each device to a unified time coordinate system based on time delay can eliminate timing offsets and ensure all data is on the same time base. Specifically, for example... Figure 6 As shown, step 103 above includes the following steps: 601. Obtain the start and end times of valid data from each acquisition device under the local timestamp to form a local valid data range.
[0079] 602. Based on the time delay, the local valid data interval is mapped to a unified time coordinate system to obtain the global valid data interval corresponding to each acquisition device.
[0080] 603. Calculate the temporal intersection of all globally valid data intervals to obtain the common valid time window.
[0081] Understandably, because each distributed data acquisition device operates independently and uses its local clock for data collection, the generated sensor data is based on its own local timestamp, resulting in inherent time-series discrepancies between different devices. To achieve unified comparison of data from multiple devices, it is first necessary to obtain the start and end times of valid data for each acquisition device under its local timestamp, forming a corresponding local valid data interval. This interval represents the valid time period from the start of normal data acquisition by a single acquisition device to the end of acquisition; it only reflects the working status of the acquisition device itself and does not provide a time-series basis for direct comparison with other acquisition devices.
[0082] After obtaining the local valid data intervals of each acquisition device, based on the time delay calculated in the preceding steps, the local valid data intervals of each acquisition device can be mapped to a unified time coordinate system to obtain the corresponding global valid data intervals. Here, the unified time coordinate system uses the time axis of the device containing the reference audio signal as a reference. The start and end times of the local valid data are shifted and corrected by the time delay, so that the valid data intervals of all acquisition devices are aligned on the same time axis, eliminating the time sequence offset caused by independent acquisition, and providing a unified benchmark for subsequent cross-comparison of valid time periods of multiple devices.
[0083] Taking a distributed wearable data acquisition device as an example, the device corresponding to the reference audio signal is equivalent to a reference device, and its local time is the unified time base, with an effective range of 2 seconds to 10 seconds. If the time delay of the first non-reference device is 0.3 seconds, its local effective range is 1.7 seconds to 9.7 seconds, which becomes 2 seconds to 10 seconds after mapping to the unified coordinate system. The time delay of the second non-reference device is -0.2 seconds, and its local effective range is 2.2 seconds to 10.2 seconds, which also becomes 2 seconds to 10 seconds after mapping. After mapping, the effective data ranges of the three devices are accurately aligned on the same time axis.
[0084] In this embodiment, the temporal intersection is the time period where all globally valid data intervals overlap on a unified time axis. Within this time period, every acquisition device is in a normal acquisition state with valid data; there is no situation where a device has not started acquisition, has stopped acquisition, or has invalid data. By calculating the temporal intersection of all globally valid data intervals, time periods where only some acquisition devices are valid and others are missing can be automatically eliminated, fundamentally avoiding analysis errors caused by data asynchrony or incompleteness. Taking a distributed wearable acquisition device as an example, the globally valid data intervals under a unified time coordinate system are: 2 seconds to 10 seconds for the reference device, 1 second to 9 seconds for the first non-reference device, and 3 seconds to 11 seconds for the second non-reference device. Taking the temporal intersection of these three globally valid data intervals, the overlapping part is 3 seconds to 9 seconds, which is the common valid time window for this acquisition. In other words, only data falling within 3 seconds to 9 seconds can ensure that all acquisition devices are valid simultaneously.
[0085] In practical applications, due to independent data acquisition by distributed devices and asynchronous local clocks, timing offsets and inconsistencies in effective time periods can occur. Time delay is used to calibrate data from each device to a unified time coordinate system, eliminating misalignment caused by clock deviations; a common effective time window is used to determine the time period during which all devices can simultaneously and effectively acquire data. Combining these two methods to perform time trimming on multi-view heterogeneous sensor data can simultaneously achieve timing alignment and effective data filtering, eliminating invalid, non-overlapping, and asynchronous data segments. This ensures that the final sensor data is completely aligned, complete, and reliable in time, providing a stable foundation for subsequent multi-source data fusion. Specifically, for example... Figure 7 As shown, step 104 above includes the following steps: 701. For each acquisition device, based on the time delay, convert the common effective time window into a clipping time interval of the acquisition device in the local time coordinate system.
[0086] 702. Based on the aforementioned clipping time interval, synchronously capture the multi-view heterogeneous sensor data corresponding to each acquisition device.
[0087] 703. Combine the multi-view heterogeneous sensor data captured by all acquisition devices to form time-aligned multi-view heterogeneous sensor data.
[0088] The local time coordinate system is based on the local time axis of each acquisition device. The specific conversion logic is based on time delay. The common effective time window under the unified time reference is reverse-mapped to the local time axis of each acquisition device to obtain the clipping interval that adapts to the local data storage time sequence of each acquisition device, ensuring that the clipping interval is accurately matched with the timestamp of the local data of the acquisition device.
[0089] To illustrate with a specific scenario, the distributed acquisition devices include device A, device B, and device C. Using the audio signal acquired by device A as the reference audio signal, meaning device A's time delay is 0, the corresponding time delay of device B is 0.3 seconds (i.e., device B's local time is 0.3 seconds later than the unified time), and the time delay of device C is -0.2 seconds (i.e., device C's local time is 0.2 seconds earlier than the unified time). Through the above calculations, the common effective time window can be obtained as 3 to 9 seconds in a unified time coordinate system. For device A, since its latency is 0, its common effective time window is consistent with the local clipping time interval, i.e., the local clipping interval is 3 seconds to 9 seconds. For device B, the 3 seconds to 9 seconds of the unified time needs to be converted back to local time. The conversion formula is: local clipping interval start / end time = unified time start / end time - latency. That is, the start time is 3 - 0.3 = 2.7 seconds, and the end time is 9 - 0.3 = 8.7 seconds. Therefore, the local clipping interval for device B is 2.7 seconds to 8.7 seconds. For device C, the conversion is similar: the start time is 3 - (-0.2) = 3.2 seconds, and the end time is 9 - (-0.2) = 9.2 seconds. Therefore, the local clipping interval for device C is 3.2 seconds to 9.2 seconds. This conversion ensures that the clipping interval of each acquisition device accurately corresponds to the data timing stored locally.
[0090] It should be noted that synchronous capture does not mean that all acquisition devices perform capture operations at the same time. Rather, the captured data all correspond to a common valid time window under a unified time coordinate system and are completely synchronized in time sequence. That is, the data of device A from 3 seconds to 9 seconds, device B from 2.7 seconds to 8.7 seconds, and device C from 3.2 seconds to 9.2 seconds are all in time sequence.
[0091] Specifically, in the process of combining the multi-view heterogeneous sensor data captured by all acquisition devices, it is necessary to perform time-series correlation of the heterogeneous data captured by each acquisition device according to the common effective time window of a unified time coordinate system. This ensures that each time node on the unified time axis contains all types of sensor data corresponding to all acquisition devices. For example, if the unified time is 2.5 seconds, the combined data includes 3.5 seconds of audio, video, and acceleration data from device A, 3.5-0.3=3.2 seconds of corresponding data from device B, and 3.5+0.2=3.7 seconds of corresponding data from device C. All data are completely synchronized in time, without errors or invalid segments.
[0092] Furthermore, as a specific implementation of the above method, embodiments of this application provide a time-series alignment device for multi-view heterogeneous sensor data, such as... Figure 8 As shown, the device includes: a cutting unit 81, a determining unit 82, a calculating unit 83, and a trimming unit 84.
[0093] The interception unit 81 is used to intercept continuous audio segments from the reference audio signal as feature templates. The reference audio signal is one audio signal among multiple audio signals, which are acquired by a distributed acquisition device. The determining unit 82 is used to determine the time delay of the non-reference audio signal relative to the reference audio signal by performing cross-correlation calculation within a preset search window for each non-reference audio signal. The calculation unit 83 is used to convert the local valid data intervals corresponding to each acquisition device to a unified time coordinate system based on the time delay, calculate the temporal intersection of the converted valid data intervals, and obtain a common valid time window. The clipping unit 84 is used to clip the time of the multi-view heterogeneous sensing data corresponding to each acquisition device based on the time delay and the common effective time window, so as to obtain time-aligned multi-view heterogeneous sensing data.
[0094] The timing alignment device for multi-view heterogeneous sensor data provided in this invention, compared with the current methods of hardware synchronization or manual alignment for timing alignment of multi-view heterogeneous sensor data, extracts continuous audio segments as feature templates from a reference audio signal. The reference audio signal is one of multiple audio signals, which are acquired by distributed acquisition devices. For each non-reference audio signal, the time delay of the non-reference audio signal relative to the reference audio signal is determined by cross-correlation within a preset search window. Based on the time delay, the local valid data intervals corresponding to each acquisition device are transformed to a unified time coordinate system, and the timing intersection of the transformed valid data intervals is calculated to obtain a common valid time window. Based on the time delay and the common valid time window, the multi-view heterogeneous sensor data corresponding to each acquisition device is time-trimmed to obtain time-aligned multi-view heterogeneous sensor data. The entire process achieves accurate estimation of time delay across multiple devices through audio cross-correlation. Based on this, the local effective data ranges of each acquisition device are unified to the same time coordinate system, ensuring that the data used subsequently are common data collected simultaneously and effectively by multiple devices. Furthermore, the time window and time delay formed by the common data are combined to perform time clipping on the multi-view heterogeneous sensor data. This enables the alignment of multiple devices and multiple types of sensors under a unified time reference. It not only effectively solves the problem of insufficient accuracy when multiple acquisition devices use hardware synchronization for time alignment, but also adapts to scenarios where the effective data ranges of different acquisition devices are inconsistent, significantly improving the time alignment accuracy of multi-view heterogeneous sensor data.
[0095] In specific application scenarios, the interception unit includes: The acquisition module is used to acquire multi-view heterogeneous sensor data concurrently acquired by distributed acquisition devices. The multi-view heterogeneous sensor data includes multiple video files, and each video file contains an audio track. The extraction module is used to extract the corresponding audio tracks from each video file to obtain multiple audio signals. The interception module is used to take one of the multiple audio signals as a reference audio signal, and intercept continuous audio segments from the reference audio signal to obtain a feature template. The interception module is specifically used for: Based on at least one of signal energy and effective speech segment length, one audio signal is selected from the multiple audio signals as a reference audio signal; A statistically significant continuous audio segment is extracted from the reference audio signal as a feature template, the statistical significance being determined by the difference in energy or spectrum between it and its noise baseline.
[0096] In specific application scenarios, the interception unit further includes: The processing module is used to extract the corresponding audio tracks from each video file to obtain multiple audio signals, and then perform sampling frequency unification and amplitude normalization on the multiple audio signals to obtain standardized multiple audio signals. Correspondingly, the interception module is also used to take one audio signal from the standardized multi-channel audio signal as a reference audio signal, and intercept continuous audio segments from the reference audio signal to obtain a feature template.
[0097] In specific application scenarios, the device further includes: The control unit is used to perform time clipping on the multi-view heterogeneous sensor data corresponding to each acquisition device based on the time delay and the common effective time window to obtain time-aligned multi-view heterogeneous sensor data. Then, it analyzes the time synchronization deviation based on the time-aligned multi-view heterogeneous sensor data and uses the time synchronization deviation feedback to adjust the exposure triggering sequence of the distributed acquisition device to achieve synchronous exposure control of the distributed acquisition device.
[0098] In specific application scenarios, the determining unit is specifically used for: For each non-reference audio signal, calculate the cross-correlation function between the non-reference audio signal and the feature template within a preset search window; Locate the peak position in the cross-correlation function; The time delay of the non-reference audio signal relative to the reference audio signal is determined based on the offset between the peak position and the starting position of the feature template in the reference audio signal.
[0099] In specific application scenarios, the computing unit is specifically used for: Obtain the start and end times of valid data from each acquisition device under the local timestamp to form a local valid data range; Based on the time delay, the local valid data interval is mapped to a unified time coordinate system to obtain the global valid data interval corresponding to each acquisition device. The unified time coordinate system is based on the time axis of the reference audio signal. Calculate the temporal intersection of all globally valid data intervals to obtain the common valid time window.
[0100] In specific application scenarios, the trimming unit is specifically used for: For each acquisition device, based on the time delay, the common effective time window is converted into a clipping time interval of the acquisition device in the local time coordinate system, with the local time coordinate system based on the local time axis of each acquisition device; Based on the aforementioned clipping time interval, multi-view heterogeneous sensor data corresponding to each acquisition device are synchronously captured. The multi-view heterogeneous sensor data captured by all acquisition devices are combined to form time-aligned multi-view heterogeneous sensor data.
[0101] Based on the above-mentioned time-series alignment method for multi-view heterogeneous sensor data, this application embodiment also provides a storage medium storing a computer program thereon, which, when executed by a processor, implements the above-mentioned time-series alignment method for multi-view heterogeneous sensor data.
[0102] Based on this understanding, the technical solution of this application can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, or portable hard drive), and includes several instructions to cause a computer device (such as a personal computer, server, or network device) to execute the methods described in the various implementation scenarios of this application.
[0103] Based on the above-described method for time-series alignment of multi-view heterogeneous sensor data and the corresponding virtual device embodiments, in order to achieve the above objectives, this application also provides a physical device for time-series alignment of multi-view heterogeneous sensor data. Specifically, it can be a computer, smartphone, tablet computer, smartwatch, server, or network device, etc. The physical device includes a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to implement the above-described method for time-series alignment of multi-view heterogeneous sensor data.
[0104] Optionally, the physical device may also include a user interface, a network interface, a camera, radio frequency (RF) circuitry, sensors, audio circuitry, a Wi-Fi module, etc. The user interface may include a display screen, input units such as a keyboard, etc., and optional user interfaces may also include USB interfaces, card reader interfaces, etc. The network interface may optionally include standard wired interfaces, wireless interfaces (such as Wi-Fi interfaces), etc.
[0105] In an exemplary embodiment, see Figure 9 The aforementioned physical device includes a communication bus, a processor, a memory, and a communication interface. It may also include an input / output interface and a display device. The various functional units can communicate with each other via the bus. The memory stores a computer program, and the processor executes the program stored in the memory to perform the timing alignment method for multi-view heterogeneous sensor data described in the above embodiments.
[0106] Those skilled in the art will understand that the physical device structure for time-series alignment of multi-view heterogeneous sensing data provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or have different component arrangements.
[0107] The storage medium may also include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the physical device for time-alignment of the aforementioned multi-view heterogeneous sensor data, supporting the operation of information processing programs and other software and / or programs. The network communication module is used to enable communication between the various components within the storage medium, as well as communication with other hardware and software in the information processing physical device.
[0108] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware platforms, or it can be implemented by hardware. By applying the technical solution of this application, compared with the existing methods, this application achieves accurate estimation of multi-device time delay through audio cross-correlation. On this basis, the local effective data intervals of each acquisition device are unified to the same time coordinate system, ensuring that the data used subsequently are all common data effectively acquired by multiple devices simultaneously. Furthermore, by combining the time window and time delay formed by the common data, time clipping is performed on the multi-view heterogeneous sensor data. It can complete the alignment processing of multiple devices and multiple types of sensors under a unified time reference. This not only effectively solves the problem of insufficient accuracy when multiple acquisition devices use hardware synchronization for time alignment, but also adapts to scenarios where the effective data intervals of different acquisition devices are inconsistent, greatly improving the time alignment accuracy of multi-view heterogeneous sensor data.
[0109] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of a preferred embodiment, and the modules or processes shown in the drawings are not necessarily essential for implementing this application. Those skilled in the art will understand that the modules in the apparatus of the embodiment can be distributed within the apparatus of the embodiment as described, or can be modified to be located in one or more apparatuses different from this embodiment. The modules of the above-described embodiment can be combined into one module, or further divided into multiple sub-modules.
[0110] The serial numbers in this application are for descriptive purposes only and do not represent the superiority or inferiority of any particular implementation scenario. The above disclosures are merely a few specific implementation scenarios of this application; however, this application is not limited thereto, and any variations conceived by those skilled in the art should fall within the protection scope of this application.
Claims
1. A method for temporal alignment of multi-view heterogeneous sensor data, characterized in that, include: A continuous audio segment is extracted from a reference audio signal as a feature template. The reference audio signal is one audio signal among multiple audio signals. The multiple audio signals are acquired by a distributed acquisition device, in which each acquisition device is deployed separately and acquires independently. For each non-reference audio signal, the time delay of the non-reference audio signal relative to the reference audio signal is determined by cross-correlation operation within a preset search window; Based on the time delay, the local valid data intervals corresponding to each acquisition device are converted to a unified time coordinate system, and the temporal intersection of the converted valid data intervals is calculated to obtain a common valid time window; Based on the time delay and the common effective time window, the time of the multi-view heterogeneous sensing data corresponding to each acquisition device is clipped to obtain time-aligned multi-view heterogeneous sensing data.
2. The method according to claim 1, characterized in that, The step of extracting continuous audio segments from the reference audio signal as feature templates specifically includes: Acquire multi-view heterogeneous sensor data concurrently collected by distributed acquisition devices. The multi-view heterogeneous sensor data includes multiple video files, each of which contains an audio track. Extract the corresponding audio tracks from each video file to obtain multiple audio signals; One of the multiple audio signals is used as a reference audio signal, and continuous audio segments are extracted from the reference audio signal to obtain a feature template. The step of using one audio signal from the multiple audio signals as a reference audio signal, and extracting continuous audio segments from the reference audio signal to obtain a feature template, specifically includes: Based on at least one of signal energy and effective speech segment length, one audio signal is selected from the multiple audio signals as a reference audio signal; A statistically significant continuous audio segment is extracted from the reference audio signal as a feature template, the statistical significance being determined by the difference in energy or spectrum between it and its noise baseline.
3. The method according to claim 2, characterized in that, After extracting the corresponding audio tracks from each video file to obtain multiple audio signals, the method further includes: The sampling frequency of the multi-channel audio signals is uniformly processed and the amplitude is normalized to obtain standardized multi-channel audio signals. Accordingly, one audio signal from the standardized multi-channel audio signal is used as a reference audio signal, and continuous audio segments are extracted from the reference audio signal to obtain a feature template.
4. The method according to claim 1, characterized in that, After performing time clipping on the multi-view heterogeneous sensor data corresponding to each acquisition device based on the time delay and the common effective time window to obtain time-aligned multi-view heterogeneous sensor data, the method further includes: Based on the time-aligned multi-view heterogeneous sensor data analysis of time synchronization deviation, and using the time synchronization deviation feedback to adjust the exposure trigger timing of the distributed acquisition device, the synchronous exposure control of the distributed acquisition device is realized.
5. The method according to claim 1, characterized in that, For each non-reference audio signal, the time delay of the non-reference audio signal relative to the reference audio signal is determined through cross-correlation calculation within a preset search window, specifically including: For each non-reference audio signal, calculate the cross-correlation function between the non-reference audio signal and the feature template within a preset search window; Locate the peak position in the cross-correlation function; The time delay of the non-reference audio signal relative to the reference audio signal is determined based on the offset between the peak position and the starting position of the feature template in the reference audio signal.
6. The method according to any one of claims 1-5, characterized in that, The process of converting the local valid data intervals corresponding to each acquisition device to a unified time coordinate system based on the time delay, and calculating the temporal intersection of the converted valid data intervals to obtain a common valid time window, specifically includes: Obtain the start and end times of valid data from each acquisition device under the local timestamp to form a local valid data range; Based on the time delay, the local valid data interval is mapped to a unified time coordinate system to obtain the global valid data interval corresponding to each acquisition device. The unified time coordinate system is based on the time axis of the reference audio signal. Calculate the temporal intersection of all globally valid data intervals to obtain the common valid time window.
7. The method according to any one of claims 1-5, characterized in that, The step of time-clipping the multi-view heterogeneous sensing data corresponding to each acquisition device based on the time delay and the common effective time window to obtain time-aligned multi-view heterogeneous sensing data specifically includes: For each acquisition device, based on the time delay, the common effective time window is converted into a clipping time interval of the acquisition device in the local time coordinate system, with the local time coordinate system based on the local time axis of each acquisition device; Based on the aforementioned clipping time interval, multi-view heterogeneous sensor data corresponding to each acquisition device are synchronously captured. The multi-view heterogeneous sensor data captured by all acquisition devices are combined to form time-aligned multi-view heterogeneous sensor data.
8. A time-series alignment device for multi-view heterogeneous sensor data, characterized in that, include: The segmentation unit is used to extract continuous audio segments from a reference audio signal as feature templates. The reference audio signal is one audio signal among multiple audio signals, which are acquired by a distributed acquisition device. The determining unit is used to determine the time delay of each non-reference audio signal relative to the reference audio signal by performing cross-correlation calculation within a preset search window for each non-reference audio signal. The calculation unit is used to convert the local valid data intervals corresponding to each acquisition device to a unified time coordinate system based on the time delay, calculate the temporal intersection of the converted valid data intervals, and obtain a common valid time window. The clipping unit is used to clip the time of the multi-view heterogeneous sensor data corresponding to each acquisition device based on the time delay and the common effective time window, so as to obtain time-aligned multi-view heterogeneous sensor data.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.
10. A computer storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.