A behavior monitoring method and device, electronic equipment and storage medium
By extracting individual pig audio from mixed audio and combining it with video features, and using a multimodal model for behavior monitoring, the problem of low accuracy in identifying abnormal vocalizations in pigs has been solved. This enables efficient and accurate location of abnormal pigs and is suitable for large-scale farms.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA MOBILE CHENGDU INFORMATION & TELECOMM TECH CO LTD
- Filing Date
- 2021-07-22
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies have low accuracy and efficiency in identifying abnormal vocalizations in pigs, making it difficult to achieve efficient monitoring in large-scale farms.
By extracting individual audio samples from mixed audio and combining them with video features, a multimodal model is used for behavior monitoring, including speech separation and behavioral feature matching, to identify abnormal behaviors in pigs.
It improves the accuracy and efficiency of behavior monitoring, enabling precise and rapid location of pigs exhibiting abnormal behavior, reducing human intervention, and is suitable for large-scale farms.
Smart Images

Figure CN115700880B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of monitoring technology, and in particular to a behavior monitoring method, device, electronic device, and storage medium. Background Technology
[0002] In animal husbandry, the health status and reproductive efficiency of pigs are important indicators for evaluating the breeding technology of a farm. The health status of pigs, in particular, is crucial; an outbreak of infectious diseases in the herd can severely impact a farm's profits and even cause incalculable economic losses. With the development of technology, remotely monitoring pig vocalizations to track their health status has become a key and challenging research area.
[0003] Among related technologies, there are three main methods for monitoring the health of pigs:
[0004] 1. Manual Inspection. Observers inspect the pigs' health in the pens, paying attention to any abnormal vocalizations. If any pigs exhibiting abnormal vocalizations are found, they are marked and their information is recorded. While this method is comprehensive and reliable, it is time-consuming and labor-intensive, relies on the observer's experience, and is only suitable for small-scale farming scenarios. It is not applicable to large-scale, high-efficiency management scenarios.
[0005] 2. This method uses envelope templates to match pig vocalizations, identifying pigs exhibiting abnormal vocal behavior. Specifically, abnormal vocalizations of various pigs are pre-collected, and envelope templates for these abnormal vocalizations are created. Then, vocalization data of the pigs to be monitored is collected, and the data is matched against the envelope templates to determine if abnormal vocalizations are present. The drawback of this method is that some sounds that appear abnormal but are not actually abnormal will also be matched against the envelope templates, resulting in low monitoring accuracy. Furthermore, it cannot simultaneously monitor abnormal vocalizations in multiple pigs.
[0006] 3. Collect audio data from pigs using audio equipment, and employ machine learning and deep learning methods to monitor and classify abnormal sounds from the pig herd. Specifically, collect audio data from the pig herd, manually distinguish between coughing and non-coughing audio data, and use audio features such as Mel Frequency Cepstral Coefficients (MFCC) or spectrograms extracted from the pig herd audio data as input to the abnormal sound classification model. Train the abnormal sound classification model using machine learning or deep learning methods, such as Dynamic Time Warping (DTW), Vector Quantization (VQ), Fuzzy C-means Clustering (FCM), Hidden Markov Model (HMM), Artificial Neural Network (ANN) algorithms, and Convolutional Neural Networks. The abnormal sound classification model then classifies the abnormal sounds from the pig herd. Input the collected audio data from the pig herd into the abnormal sound classification model. If the abnormal sound classification model determines that there is an abnormal sound in the audio data, and combined with the location of the abnormal pig herd, manually identify the pigs exhibiting abnormal vocal behavior within the herd. This method typically involves collecting audio data from the pig herd, which can only identify pigs exhibiting abnormal vocalizations, but cannot pinpoint the exact target pig. Therefore, manual labor is often required to locate the target pig, making the identification inefficient. Secondly, audio features such as MFCC, Power Spectral Density (PSD), and Linear Predictive Cepstral Coefficient (LPCC) are generally used as input data for the abnormal sound classification model. Models trained in this way have low discrimination ability against abnormal sounds such as coughing, squealing, and chewing on metal, resulting in poor classification accuracy. Furthermore, training an abnormal sound classification model solely using pig audio data is insufficient to accurately identify abnormal vocalizations in pigs.
[0007] In other words, the relevant technologies still face technical problems in identifying pigs with abnormal vocal behavior, including low accuracy and low efficiency. Summary of the Invention
[0008] In view of this, the main objective of the embodiments of this application is to provide a behavior monitoring method, device, electronic device and storage medium to solve the problems of low accuracy and low efficiency in identifying pigs with abnormal behavior in related technologies.
[0009] To achieve the above objectives, the technical solution of this application embodiment is implemented as follows:
[0010] This application provides a behavior monitoring method, the method comprising:
[0011] At least one second audio is extracted from a first audio; the first audio represents a sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects.
[0012] Each of the at least one second audio and its corresponding second video is input into a first set model to obtain a first behavioral feature corresponding to each of the at least two monitored objects.
[0013] The first behavioral feature corresponding to each of the at least two monitored objects is matched with a first predetermined behavioral feature to obtain a first behavioral monitoring result; wherein...
[0014] The second video represents a video that has been captured of the corresponding monitored object.
[0015] In the above scheme, matching the first behavioral feature corresponding to each of the at least two monitored objects with the first predetermined behavioral feature includes:
[0016] The first behavioral feature is determined to match the first defined behavioral feature if at least one of the following conditions is met:
[0017] The first spectrogram contains speech signals with amplitudes greater than a set threshold; wherein, the first spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the first behavioral feature;
[0018] Based on the second video corresponding to the monitoring object corresponding to the first behavioral feature, it is determined that the corresponding monitoring object has performed the set behavior.
[0019] The method in the above scheme further includes:
[0020] If the first behavior monitoring result indicates that the first behavior feature of the monitored object matches the first set behavior feature, the corresponding second audio and the corresponding second video are input into the second set model to obtain the second behavior feature of the monitored object.
[0021] The obtained second behavioral features are matched with the second set behavioral features to obtain the second behavioral monitoring results for the corresponding monitoring object; wherein...
[0022] The second set behavioral characteristics characterize the abnormal behavior of the monitored object.
[0023] In the above scheme, matching the obtained second behavioral feature with the second set behavioral feature includes:
[0024] The second behavioral feature is determined to match the second defined behavioral feature if the obtained second behavioral feature satisfies at least one of the following conditions:
[0025] In the second spectrogram, the time interval between occurrences of speech signals with amplitudes greater than a set threshold is less than a set time interval; wherein, the second spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the second behavioral feature;
[0026] The duration of a speech signal with an amplitude greater than a set threshold in the second spectrogram is longer than the set duration.
[0027] The method in the above scheme further includes:
[0028] If the second behavior monitoring result indicates that the second behavior feature matches the second set behavior feature, the monitoring object corresponding to the second behavior feature is determined based on the audio encoding of the second audio of the monitoring object corresponding to the second behavior feature.
[0029] In the above scheme, after extracting at least one second audio from the first audio, the method further includes:
[0030] The monitoring object corresponding to the second audio is determined based on the audio encoding of the second audio.
[0031] Obtain the second video corresponding to the monitored object.
[0032] In the above scheme, before extracting at least one second audio from the first audio, the method further includes:
[0033] The sound emitted by each monitored object is input into a set voice encoder to obtain the audio code of the sound emitted by each monitored object;
[0034] Store the correspondence between each monitored object and the audio code of the emitted sound.
[0035] This application embodiment also provides a model training method, the method being used to train a first predetermined model in any of the above-described behavior monitoring methods, the method comprising:
[0036] Acquire audio and video samples of the monitored object; the audio sample represents the sound emitted by the monitored object; the video sample represents a video of the monitored object captured simultaneously with the audio sample.
[0037] The audio features corresponding to the audio sample and the video sample are input into a first set model to obtain a first output result; the first output result represents the first behavioral feature corresponding to the monitored object.
[0038] The loss value is calculated based on the first output result, and the weight parameters of the first set model are updated based on the loss value; wherein,
[0039] The audio features corresponding to the audio sample include cross-correlation matrix features, which represent the correlation coefficients between two adjacent frames in the spectrogram corresponding to the audio sample.
[0040] In the above scheme, the audio features corresponding to the audio samples also include at least one of the following:
[0041] The spectrogram corresponding to the audio sample;
[0042] The Mel frequency cepstral features corresponding to the audio samples;
[0043] The first-order difference feature corresponding to the audio sample;
[0044] The second-order difference features corresponding to the audio samples.
[0045] This application embodiment also provides a behavior monitoring device, the device comprising:
[0046] An extraction unit is configured to extract at least one second audio from a first audio; the first audio represents a sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects.
[0047] An input unit is used to input each of the at least one second audio and the corresponding second video into a first set model to obtain a first behavioral feature corresponding to each of the at least two monitoring objects;
[0048] A matching unit is configured to match the first behavioral feature corresponding to each of the at least two monitored objects with a first predetermined behavioral feature to obtain a first behavioral monitoring result; wherein,
[0049] The second video represents a video that has been captured of the corresponding monitored object.
[0050] This application embodiment also provides a model training apparatus, the apparatus comprising:
[0051] The acquisition unit is used to acquire audio samples and video samples of the monitored object; the audio sample represents the sound emitted by the monitored object; the video sample represents a video of the monitored object captured simultaneously with the audio sample.
[0052] An input unit is used to input the audio features corresponding to the audio sample and the video sample into a first set model to obtain a first output result; the first output result represents the first behavioral feature corresponding to the monitored object.
[0053] A calculation unit is configured to calculate a loss value based on the first output result, and update the weight parameters of the first predetermined model based on the loss value; wherein,
[0054] The audio features corresponding to the audio sample include cross-correlation matrix features, which represent the correlation coefficients between two adjacent frames in the spectrogram corresponding to the audio sample.
[0055] This application also provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor, wherein,
[0056] When the processor is used to run the computer program, it performs the steps of any of the above methods.
[0057] This application also provides a storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the above methods.
[0058] In this embodiment, at least one second audio is extracted from a first audio. The first audio represents the sound emitted by at least two monitored objects, and each of the at least one second audio represents the sound emitted by one of the at least two monitored objects. Each of the at least one second audio and its corresponding second video is input into a first set model to obtain a first behavioral feature corresponding to each of the at least two monitored objects. The first behavioral feature corresponding to each of the at least two monitored objects is matched with the first set behavioral feature to obtain a first behavioral monitoring result. The second video represents a video of the corresponding monitored object. In this way, the individual audio of each monitored object can be extracted from the mixed audio composed of the sounds of multiple monitored objects, and combined with the video corresponding to each monitored object for behavioral monitoring. This multimodal approach improves the accuracy of behavioral monitoring. Furthermore, since the audio of a single monitored object can be extracted from the mixed audio for behavioral monitoring, when abnormal behavior is detected, the monitored object with abnormal behavior can be accurately and quickly located, thereby improving the efficiency of locating the monitored object. Attached Figure Description
[0059] Figure 1 A schematic diagram illustrating the implementation process of the behavior monitoring method provided in this application embodiment;
[0060] Figure 2A schematic diagram illustrating the training process of the speech separation model provided in this application embodiment;
[0061] Figure 3 A schematic diagram of a spectrogram provided in an embodiment of this application;
[0062] Figure 4 A schematic diagram illustrating behavior monitoring using the second model provided in this application embodiment;
[0063] Figure 5 A schematic diagram illustrating the implementation flow of the behavior monitoring method provided in the application embodiments of this application;
[0064] Figure 6 A schematic diagram illustrating the implementation process of the model training method provided in this application embodiment;
[0065] Figure 7 A schematic diagram of audio data processing provided in an embodiment of this application;
[0066] Figure 8 A schematic diagram illustrating the training of video sample feature extraction according to an embodiment of this application;
[0067] Figure 9 A schematic diagram illustrating the training of audio features extracted from audio samples according to an embodiment of this application;
[0068] Figure 10 A schematic diagram of the behavior monitoring device provided in the embodiments of this application;
[0069] Figure 11 A schematic diagram of the model training apparatus provided in the embodiments of this application;
[0070] Figure 12 This is a schematic diagram of the hardware structure of the electronic device according to an embodiment of this application. Detailed Implementation
[0071] In related technologies, there are still technical problems with low accuracy and low efficiency in identifying pigs exhibiting abnormal vocal behavior.
[0072] Based on this, embodiments of this application provide a behavior monitoring method, apparatus, electronic device, and storage medium. At least one second audio is extracted from a first audio, wherein the first audio represents the sound emitted by at least two monitored objects, and each of the at least one second audio represents the sound emitted by one of the at least two monitored objects. Each of the at least one second audio and its corresponding second video is input into a first preset model to obtain a first behavioral feature corresponding to each of the at least two monitored objects. The first behavioral feature corresponding to each of the at least two monitored objects is matched with the first preset behavioral feature to obtain a first behavior monitoring result. The second video represents a video recording of the corresponding monitored object. Thus, individual audio of each monitored object can be extracted from a mixed audio composed of the sounds of multiple monitored objects, and combined with the video corresponding to each monitored object for behavior monitoring. This multimodal approach improves the accuracy of behavior monitoring. Furthermore, since the audio of a single monitored object can be extracted from the mixed audio for behavior monitoring, when abnormal behavior is detected, the monitored object exhibiting abnormal behavior can be accurately and quickly located, thereby improving the efficiency of locating the behavior monitoring object.
[0073] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. For ease of understanding, the embodiments of this application will use pigs as an example to illustrate the behavior monitoring method provided by this application.
[0074] Figure 1 This is a schematic diagram illustrating the implementation flow of the behavior monitoring method provided in an embodiment of this application. Figure 1 As shown, the method includes:
[0075] Step 101: Extract at least one second audio from the first audio; the first audio represents the sound emitted by at least two monitored objects; each of the at least one second audio represents the sound emitted by one of the at least two monitored objects.
[0076] Here, to achieve more accurate behavior monitoring, at least one second audio is first extracted from the first audio. The first audio represents the sound emitted by at least two monitored objects, and each second audio represents the sound emitted by one monitored object. When the monitored object is a pig, the first audio represents a mixed audio consisting of sounds emitted by multiple pigs, and the second audio represents the sound emitted by any one of the pigs. By extracting the audio of a single pig from the mixed audio, interference from the audio of other pigs can be eliminated, facilitating behavior monitoring of each individual pig.
[0077] In practical applications, at least one second audio signal can be extracted from a first audio signal using a speech separation model. The speech separation model consists of two parts: an audio encoder and an audio filter.
[0078] For the audio encoder, during the data acquisition phase, to improve the accuracy of model training, audio is collected from each pig individually. Specifically, the pigs are placed in separate pens, and audio is collected from each pig individually. The log-Mel-Cepstral Energy (LME) features of each pig's audio are extracted and input into a three-layer Long Short-Term Memory (LSTM) model to obtain a 256-dimensional audio vector for each pig. This audio vector represents the timbre of each pig and can uniquely identify the pig.
[0079] For the audio filter, the mixed audio consisting of the sounds emitted by multiple pigs is used as the input to the speech separation model. Combined with the pig-vector corresponding to each pig, and using the audio of the pig corresponding to the pig-vector as the label, a time-domain and frequency-domain filtering network is trained. That is, the input is the pig-vector of a single pig and the mixed audio consisting of the sounds emitted by multiple pigs. After training, the filtering network will remove the interference audio of other pigs and output the audio of the single pig corresponding to the pig-vector.
[0080] For ease of understanding, the audio separated by the speech separation model is used as the audio of the target pig.
[0081] Figure 2 This is a schematic diagram illustrating the training process of the speech separation model provided in the embodiments of this application, as shown below. Figure 2 As shown:
[0082] First, the audio of the target pig is input into a three-layer LSTM model to obtain the pig-vector corresponding to the target pig.
[0083] A short-time Fourier transform (SIFT) is performed on the noisy mixed audio composed of sounds emitted by multiple pigs to obtain the spectrogram corresponding to the noisy mixed audio. The amplitude spectrum of the spectrogram and the pig-vector of the target pig are input into a filtering network, and the filtering network outputs soft mask features.
[0084] The soft mask features are multiplied by the spectrogram corresponding to the noisy mixed audio to obtain the enhanced amplitude spectrum. The original amplitude spectrum of the spectrogram corresponding to the noisy mixed audio is then merged with the enhanced amplitude spectrum to obtain the spectrogram mask.
[0085] The enhanced audio is obtained by performing an inverse SIFT transform on the spectrogram mask.
[0086] The audio of the target pig is denoised to obtain the denoised clean audio. SIFT transform is performed on the clean audio of the target pig to obtain the corresponding spectrogram. The loss value is calculated by calculating the difference between the amplitude spectrum of the spectrogram mask and the spectrogram corresponding to the clean audio of the target pig. The parameters of the speech separation model are updated based on the loss value.
[0087] For example, the filtering network consists of 8 convolutional layers (CNN), 1 LSTM layer, and 2 fully connected layers (FC). Except for the last layer, the activation function of the remaining layers is the Rectified Linear Unit (ReLU), and the activation function of the last layer is the sigmoid function. In each layer, the pig-vector of the target pig is repeatedly concatenated with the output of the previous convolutional layer, and the concatenated value is used as the input of the next layer. The detailed parameters of each layer are shown in Table 1.
[0088]
[0089] Table 1
[0090] Where Width represents the width function, Dilation represents the dilation function, time represents the value in the time domain, freq represents the value in the frequency domain, and Filters / Nodes represent filters.
[0091] Step 102: Input each of the at least one second audio and the corresponding second video into the first set model to obtain the first behavioral feature corresponding to each of the at least two monitoring objects; wherein, the second video represents a video of the corresponding monitoring object.
[0092] Here, after extracting at least one second audio from the first audio, each second audio and its corresponding second video are input into a first set model. The second video represents a video of the corresponding detected object. For example, if the second audio represents the sound made by pig 1, then the second video represents a video of pig 1.
[0093] It should be noted that the second audio recording was taken at the same time as the second video recording. In practical applications, video streams of multiple pigs can be obtained by capturing videos with the terminal, and the corresponding audio and video for each pig can be extracted from these video streams.
[0094] After inputting the second audio and the corresponding second video into the first set model, the first behavioral feature corresponding to the monitored object is obtained.
[0095] When the monitoring subject is pigs, the first set model can be used to identify whether pigs have coughing behavior characteristics.
[0096] Step 103: Match the first behavioral feature corresponding to each of the at least two monitored objects with the first set behavioral feature to obtain the first behavioral monitoring result.
[0097] Here, after obtaining the first behavioral characteristics of each monitored object, the first behavioral characteristics of each monitored object are matched with the first set behavioral characteristics to obtain the first behavioral monitoring results.
[0098] In practical applications, the first defined behavioral characteristic represents the coughing behavior of pigs. The monitoring results of the first behavior can indicate whether the pigs corresponding to the first behavioral characteristic exhibit coughing behavior.
[0099] In one embodiment, matching the first behavioral feature corresponding to each of the at least two monitored objects with a first predetermined behavioral feature includes:
[0100] The first behavioral feature is determined to match the first defined behavioral feature if at least one of the following conditions is met:
[0101] The first spectrogram contains speech signals with amplitudes greater than a set threshold; wherein, the first spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the first behavioral feature;
[0102] Based on the second video corresponding to the monitoring object corresponding to the first behavioral feature, it is determined that the corresponding monitoring object has performed the set behavior.
[0103] Here, the first defined behavioral characteristic represents the coughing behavior of the monitored subject. Coughing is a process caused by abdominal muscle contraction to generate subglottic pressure, resulting in a strong airflow impact in the vocal tract through repeated sudden opening of the glottis, accompanied by a typical sound. The amplitude of the speech signal change in the spectrogram of the audio corresponding to a pig's cough differs significantly from the amplitude of the speech signal change in the spectrogram of the audio corresponding to normal vocalization. When a pig vocalizes normally, the amplitude of the speech signal in the spectrogram is smaller; when a pig coughs, the amplitude of the speech signal in the spectrogram is larger. Therefore, the amplitude of the speech signal in the spectrogram can be used to determine whether coughing behavior has occurred. Specifically, if there is a speech signal in the spectrogram with an amplitude greater than a set threshold, the pig corresponding to the second audio is considered to have exhibited coughing behavior characteristics. Here, the set threshold represents the amplitude of the speech signal corresponding to the audio during normal vocalization.
[0104] Figure 3 A schematic diagram of the spectrogram provided in the embodiments of this application, such as Figure 3 As shown:
[0105] In Figure a, the amplitude of the speech signal in time period I is significantly greater than that in time period II. If the threshold is set to 0.05, the amplitude of the speech signal in time period I is around 0.1, which is greater than the threshold of 0.05. Therefore, this indicates that coughing behavior occurred in time period I, meaning that the first behavioral feature matches the first set behavioral feature in time period I. However, the amplitude of the speech signal in time period II is less than 0.05, indicating that no coughing behavior occurred in time period II, meaning that the first behavioral feature does not match the first set behavioral feature in time period II.
[0106] In Figure b, if the threshold is set to 0.1, the amplitude of the speech signal in time period I and time period III is significantly greater than the threshold of 0.1. This indicates that coughing behavior occurred in time periods I and III, meaning that the first behavioral feature matches the first set behavioral feature in time periods I and III. However, the amplitude of the speech signal in time period II is less than 0.1, indicating that no coughing behavior occurred in time period II, meaning that the first behavioral feature does not match the first set behavioral feature in time period II.
[0107] When pigs cough, it is usually accompanied by typical coughing behaviors, such as body shaking, back arching, and hind leg shaking. Therefore, based on the second video of the monitoring object corresponding to the first behavioral feature, it can be determined whether the corresponding monitoring object has exhibited a predetermined behavior, thus determining whether the first behavioral feature matches the first predetermined behavioral feature. The predetermined behavior can be body shaking, back arching, and hind leg shaking. If, based on the second video of the monitoring object corresponding to the first behavioral feature, it is determined that the corresponding monitoring object has exhibited the predetermined behavior, then the first behavioral feature matches the first predetermined behavioral feature.
[0108] In practical applications, after obtaining the second video corresponding to the second audio, the second video is first segmented into frames to extract valid images. Typically, the OpenCV algorithm is used to extract features frame by frame from the second video. After extracting appearance features from each frame of the extracted second video using a CNN, LSTM is used to learn temporal features, thus achieving the vector output of the second video. When extracting valid images, images showing behaviors such as body tremors, back arching, and hind leg shaking accompanying a pig's cough are primarily selected as valid images.
[0109] By performing these two methods of judgment, it is possible to combine audio and video features to determine whether the first behavioral feature of the monitored object matches the first set behavioral feature, thereby improving the accuracy of the judgment results.
[0110] In one embodiment, the method further includes:
[0111] If the first behavior monitoring result indicates that the first behavior feature of the monitored object matches the first set behavior feature, the corresponding second audio and the corresponding second video are input into the second set model to obtain the second behavior feature of the monitored object.
[0112] The obtained second behavioral features are matched with the second set behavioral features to obtain the second behavioral monitoring results for the corresponding monitoring object; wherein...
[0113] The second set behavioral characteristics characterize the abnormal behavior of the monitored object.
[0114] Here, if the first behavior monitoring result indicates that the first behavioral feature corresponding to the monitored object matches the first set behavioral feature, it means that the first set model has determined that the corresponding pig exhibits coughing behavior. However, the pig's coughing may be caused by choking while drinking water or playing, and is not necessarily due to illness. Therefore, to further determine whether the pig's coughing behavior is caused by illness, the corresponding second audio and video are input into the second set model to obtain the second behavioral feature of the monitored object. The second set model is used to determine whether the monitored object possesses the second set behavioral feature based on its voice and video features. The second set behavioral feature represents abnormal behavior of the corresponding monitored object. In practical applications, the second set behavioral feature represents the disease-related behavioral characteristics of the monitored object.
[0115] The output of the second set model is the second behavioral feature corresponding to the monitored object. After obtaining the second behavioral feature, the second behavioral feature is matched with the second set behavioral feature to obtain the second behavioral monitoring result of the monitored object.
[0116] By inputting the second audio and the corresponding second video into the second set model after matching the first behavior monitoring result, and matching the second behavior feature with the second set behavior feature to obtain the second behavior monitoring result, it is possible to further determine whether the corresponding monitoring object has the second set behavior feature, thereby improving the accuracy of the behavior monitoring of the monitoring object.
[0117] In one embodiment, matching the obtained second behavioral feature with the second set behavioral feature includes:
[0118] The second behavioral feature is determined to match the second defined behavioral feature if the obtained second behavioral feature satisfies at least one of the following conditions:
[0119] In the second spectrogram, the time interval between occurrences of speech signals with amplitudes greater than a set threshold is less than a set time interval; wherein, the second spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the second behavioral feature;
[0120] The duration of a speech signal with an amplitude greater than a set threshold in the second spectrogram is longer than the set duration.
[0121] Here, if the obtained second behavioral feature satisfies at least one of the following conditions, it is determined that the second behavioral feature matches the second set behavioral feature: specifically, the time interval between occurrences of speech signals with amplitudes greater than a set threshold in the second spectrogram is less than a set time interval; and / or, the duration of speech signals with amplitudes greater than the set threshold in the second spectrogram is greater than a set duration. Wherein, the second spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the second behavioral feature.
[0122] Because the amplitude variation of the speech signal in the spectrogram corresponding to the audio of a pig vocalizing normally is relatively small, while the amplitude variation of the speech signal in the spectrogram corresponding to the audio of a pig coughing is relatively large, the speech signal with an amplitude greater than a set threshold in the spectrogram can be considered as indicating that the pig is exhibiting coughing behavior. Pigs exhibiting disease-related behavior typically cough continuously within a short period of time, and the duration of each cough is also relatively long. Therefore, if the time interval between speech signals with amplitudes greater than the set threshold in the second spectrogram is shorter than the set time interval, it indicates that the pig's coughing behavior is occurring frequently, indicating that the pig exhibits disease-related behavior. In this case, the second behavioral characteristic matches the second set behavioral characteristic. If the duration of speech signals with amplitudes greater than the set threshold in the second spectrogram is longer than the set duration, it indicates that the pig's single cough is longer, indicating that the pig exhibits disease-related behavior. In this case, the second behavioral characteristic matches the second set behavioral characteristic.
[0123] By judging the time interval and duration of voice signals with amplitudes greater than a set threshold, it is possible to determine whether the second behavioral feature matches the second set behavioral feature, thereby improving the accuracy of the judgment result.
[0124] In one embodiment, matching the obtained second behavioral feature with the second set behavioral feature further includes:
[0125] Based on the second video corresponding to the monitoring object corresponding to the second behavioral feature, it is determined that the corresponding monitoring object has performed a first set behavior, and the second behavioral feature is determined to match the second set behavioral feature.
[0126] Here, pigs exhibiting disease-related behavioral characteristics will display typical behaviors when coughing, such as open-mouth panting, nasal and oral drooling, a dog-sitting posture, and abdominal breathing. Therefore, based on the second video of the monitored object corresponding to the second behavioral characteristic, it can be determined whether the monitored object has exhibited the first predetermined behavior, thus determining a match between the second behavioral characteristic and the second predetermined behavioral characteristic. The first predetermined behavior can be one or more of open-mouth panting, nasal and oral drooling, a dog-sitting posture, and abdominal breathing. If, based on the second video of the monitored object corresponding to the second behavioral characteristic, it is determined that the monitored object has exhibited the first predetermined behavior, then the second behavioral characteristic matches the second predetermined behavioral characteristic.
[0127] It should be noted that the feature extraction process of the second video in the second setting model in this application embodiment is the same as the feature extraction process of the second video in the first setting model mentioned above. The difference lies in the different image features extracted by the two.
[0128] Since coughing occurs in normal daily life due to choking while drinking water or playing, when using OpenCV to extract features from frame-by-frame images in a video, we also extract the behavior of coughing due to choking while drinking water or playing. This allows for a better distinction from coughing behavior caused by illness.
[0129] By using whether the first predetermined behavior occurs in the second video to determine whether the second behavioral feature matches the second predetermined behavioral feature, the behavioral monitoring results of the monitored object can be further accurately determined from the video perspective.
[0130] Figure 4 A schematic diagram illustrating behavior monitoring of the second model provided in this application embodiment, as shown below. Figure 4 As shown:
[0131] The video of the pigs is segmented into frames, and features are extracted from each frame using CNN and LSTM to obtain the video output. The cough interval and single cough duration features are extracted from the audio, and the audio output is obtained through two fully connected (FC) layers. The video and audio outputs are concatenated, and the final behavioral monitoring result is obtained through two FC layers and one softmax layer to determine whether the pigs exhibit disease-related behavioral characteristics.
[0132] In one embodiment, the method further includes:
[0133] If the second behavior monitoring result indicates that the second behavior feature matches the second set behavior feature, the monitoring object corresponding to the second behavior feature is determined based on the audio encoding of the second audio of the monitoring object corresponding to the second behavior feature.
[0134] Here, if the second behavioral feature matches the second set behavioral feature, since the second set behavioral feature represents abnormal behavioral characteristics, it indicates that the monitoring object corresponding to the second behavioral feature has abnormal behavioral characteristics. In this case, it is necessary to determine the monitoring object with abnormal behavioral characteristics. Specifically, the monitoring object is determined based on the audio code of the second audio of the monitoring object corresponding to the second behavioral feature. Since each monitoring object corresponds to a unique audio code, and the audio code can uniquely identify the monitoring object, that is, there is a one-to-one correspondence between the audio code and the monitoring object, when determining the audio code of the second audio, the monitoring object corresponding to the audio code can be determined according to the correspondence between the audio code and the monitoring object.
[0135] By identifying abnormal behavioral characteristics based on the audio encoding of the second audio source corresponding to the monitoring object corresponding to the second behavioral characteristics when the second behavioral monitoring results indicate abnormal behavioral characteristics, the corresponding monitoring object can be determined. This allows for accurate identification of monitoring objects with abnormal behavioral characteristics, thereby improving the efficiency of finding such objects.
[0136] In one embodiment, after extracting at least one second audio from the first audio, the method further includes:
[0137] The monitoring object corresponding to the second audio is determined based on the audio encoding of the second audio.
[0138] Obtain the second video corresponding to the monitored object.
[0139] Here, after extracting at least one second audio from the first audio, the behavior monitoring method further includes determining the monitoring object corresponding to the second audio based on the audio encoding of the second audio. Since the audio encoding of the second audio can uniquely identify the corresponding monitoring object, the monitoring object corresponding to the second audio can be determined based on the audio encoding of the second audio. After determining the monitoring object, the second video corresponding to the monitoring object is obtained. The second video and the second audio were acquired at the same time.
[0140] By extracting the second audio and then determining the corresponding second video based on the audio encoding of the second audio, it is convenient to monitor the behavior of the monitored object based on the audio and video of the same monitored object, thereby improving the accuracy of behavior monitoring.
[0141] In one embodiment, before extracting at least one second audio from the first audio, the method further includes:
[0142] The sound emitted by each monitored object is input into a set voice encoder to obtain the audio code of the sound emitted by each monitored object;
[0143] Store the correspondence between each monitored object and the audio code of the emitted sound.
[0144] Here, to improve the training accuracy of the speech separation model, the sound emitted by each monitored object is collected individually. To better locate the specific monitored object based on its sound, the sound emitted by each monitored object is input into a designated speech encoder to obtain the audio code of the sound emitted by each monitored object. In practical applications, the designated speech encoder can be a three-layer LSTM model.
[0145] After obtaining the audio code of the sound emitted by each monitored object, the correspondence between each monitored object and the audio code of the emitted sound is stored. In practical applications, each monitored object is numbered; therefore, the correspondence between the number of the monitored object and the audio code of the emitted sound can be stored.
[0146] By obtaining the audio code of the sound emitted by each monitored object and storing the correspondence between the monitored object and the audio code, it is convenient to accurately determine the corresponding monitored object based on the audio code, thereby improving the efficiency and accuracy of the monitoring object determination.
[0147] Figure 5 A schematic diagram illustrating the implementation flow of the behavior monitoring method provided in the application embodiments of this application is shown below. Figure 5 As shown:
[0148] The audio vector corresponding to the target pig (pig-vector) is mixed with the sounds emitted by multiple pigs and input into a speech separation model. The speech separation model extracts the audio of the target pig from this mixed audio. The audio and video of the target pig are input into a first-defined model to obtain the output results of the audio and video parts. These two output results are concatenated and passed through an FC layer and a softmax layer to obtain a judgment result on whether the target pig has coughing behavior. If the judgment result indicates that the target pig has coughing behavior, the coughing time interval and the duration of a single cough are extracted from the target pig's audio and used as the input of the audio part model. The video of the target pig's behavior is used as the input of the video part model. The output results of the audio part model and the video part model are concatenated and connected through an FC layer and a softmax layer to obtain the final judgment result. If the final judgment result indicates that the pig has diseased behavior, the pig with diseased behavior is accurately located based on the audio number corresponding to the target pig's audio.
[0149] In this embodiment, at least one second audio is extracted from a first audio. The first audio represents the sound emitted by at least two monitored objects, and each of the at least one second audio represents the sound emitted by one of the at least two monitored objects. Each of the at least one second audio and its corresponding second video is input into a first set model to obtain a first behavioral feature corresponding to each of the at least two monitored objects. The first behavioral feature corresponding to each of the at least two monitored objects is matched with the first set behavioral feature to obtain a first behavioral monitoring result. The second video represents a video of the corresponding monitored object. In this way, the individual audio of each monitored object can be extracted from the mixed audio composed of the sounds of multiple monitored objects, and combined with the video corresponding to each monitored object for behavioral monitoring. This multimodal approach improves the accuracy of behavioral monitoring. Furthermore, since the audio of a single monitored object can be extracted from the mixed audio for behavioral monitoring, when abnormal behavior is detected, the monitored object with abnormal behavior can be accurately and quickly located, thereby improving the efficiency of locating the monitored object.
[0150] This application also provides a model training method. Figure 6 This is a schematic diagram illustrating the implementation flow of the model training method provided in an embodiment of this application. Figure 6 As shown, the method includes:
[0151] Step 601: Obtain audio and video samples of the monitored object; the audio sample represents the sound emitted by the monitored object; the video sample represents a video of the monitored object captured simultaneously with the audio sample.
[0152] Here, audio and video samples of the monitored object are first obtained. The audio sample represents the sound emitted by the monitored object, and the video sample represents the video of the monitored object being captured. The acquisition time point is the same as the acquisition time point of the audio sample.
[0153] This application uses pigs as an example to illustrate the model training method.
[0154] For example, the monitoring subjects can be pigs aged five and a half months, including those exhibiting disease behavior, with each pig having an average weight of 60 kg. Thirty pigs can be used as monitoring subjects, each with a unique corresponding number.
[0155] Audio and video samples can be collected during seasonal transitions such as late winter and early summer, as these are typically the peak seasons for disease-related behaviors in pigs.
[0156] The pigsty is 27.5m long, 13.7m wide, and 3.2m high. It contains 30 enclosures, averaging one pig per enclosure, for a total of 30 pigs. Each enclosure is surrounded by a 1.1m high iron fence.
[0157] The recording equipment consisted of microphones. One microphone with a frequency range of 100Hz-16kHz was placed in each pigpen within the pigsty. These microphones were connected to the laptop's sound card, and recording was performed using recording software on the laptop. The microphones were fixed at a height of 1.4m above the ground and approximately 0.8m from the backs of the pigs. The laptop's sound card had a sampling rate of 44.1kHz and a resolution of 16 bits.
[0158] The video recording equipment is a bullet camera. The mixed audio and video data from multiple pigs is collected from pigs in multiple pens. For example, if 5 pigs are selected as one pen, and 30 pigs are divided into 6 pens, mixed audio and video data are collected from the pigs in these 6 pens. The collection period can be 3 days.
[0159] To train the initial model, audio data for each pig needs to be acquired individually. Specifically, this can be done by placing 30 pigs in separate pens and collecting audio data from each pig individually over a period of 3 days. After collecting the audio data for each pig, a dataset S1 is generated based on the audio data for each pig. The audio data from different pigs is then denoised to obtain clean audio data, and dataset S is generated based on this clean audio data. 11 The specific noise reduction method is as follows: spectral subtraction is used for noise reduction processing. Noise generally comes from two aspects: environmental noise and noise generated by the recording equipment itself. Spectral subtraction assumes that the speech signal in a pigsty environment is a superposition of a clean sound signal and a noise signal. Therefore, the average noise energy of the sound signal can be estimated by the silent part of the overall signal, and then the stable noise part in the sound signal is removed to obtain a clean sound signal.
[0160] To better train the initial model, audio data from five pigs in the same pen were selected from dataset S1 and fused. Since there were six pens in total, audio data was collected over three days, resulting in six 72-hour audio segments. These six 72-hour audio segments were then segmented into 15-second segments, and audio segments with sound were selected. Based on these audio segments with sound, a mixed audio dataset S2 was generated. Each segment in S2 is then processed in S... 11 Each dataset has a corresponding training label. For dataset S... 11 The coughing and non-coughing components were manually separated, and each component was further segmented into 15-second intervals, which were then saved as a cough dataset S. 12 Non-cough dataset S13 It should be noted that S 12 and S 13 The dataset contains other unusual sounds; here, we only differentiate between coughing sounds. All datasets correspond to pig numbers and collection times. Videos of corresponding durations are extracted as video samples based on the audio collection time.
[0161] Figure 7 This is a schematic diagram of audio data processing provided in an embodiment of this application, such as... Figure 7 As shown:
[0162] The audio of the target pig is input into the set speech encoder to generate the corresponding pig-vector. The audio of the target pig is denoised to obtain the clean audio of the target pig. The clean audio is used as the training label of the speech separation model. The clean audio of the target pig and the noisy audio of other pigs are mixed together to form a mixed audio dataset. The mixed audio dataset is input into the speech separation model for training.
[0163] Step 602: Input the audio features corresponding to the audio sample and the video sample into the first set model to obtain a first output result; the first output result represents the first behavioral feature corresponding to the monitored object. The audio features corresponding to the audio sample include cross-correlation matrix features, which represent the correlation coefficients between two adjacent frames in the spectrogram corresponding to the audio sample.
[0164] Here, the audio features corresponding to the audio samples and the video samples are input into the first set model to obtain the first output result, which represents the first behavioral feature corresponding to the monitored object.
[0165] Specifically, extract the cough dataset S respectively. 12 Non-cough dataset S 13 The speech features of S 12 The corresponding speech features and the video corresponding to the cough dataset are input into the first predefined model. The model is trained using the speech features corresponding to the cough as labels to obtain the first output result; S 13 The corresponding speech features and videos from the non-coughing dataset are input into a first-defined model. The model is trained using the speech features corresponding to the non-coughing state as labels to obtain a first output. The audio features also include cross-correlation matrix features, which represent the correlation coefficients between adjacent frames in the spectrogram of the audio sample. Since the amplitude of speech signal changes in the spectrogram corresponding to the audio of a pig coughing differs significantly from the amplitude of speech signal changes in the spectrogram corresponding to the audio of normal vocalization, similarity features between adjacent frames in the spectrogram of the audio sample are extracted for analysis.
[0166] After obtaining the spectral energy corresponding to the audio sample, the entire frequency domain is divided into M equal frequency bands on the Mel scale. There is overlap between the frequency bands. Specifically, the center frequency of the previous frequency band is the starting frequency of the next frequency band. The cross-correlation coefficients of corresponding frequency bands of two adjacent frames in the spectrogram are calculated, and the cross-correlation coefficients of the obtained M frequency bands are used as the dynamic characteristics of the input signal of one frame.
[0167] Assuming s(n,k) represents the spectral energy corresponding to the k-th point after the Fast Fourier Transform (FFT) of the spectrogram of the n-th frame, the formula for calculating the cross-correlation coefficient cc(n,m) of the m-th frequency band of the n-th frame is:
[0168]
[0169] Where, k mi and k mh Let N and M be the start and end frequencies of the m-th frequency band after FFT of the spectrogram of the n-th frame, respectively, where N is the total number of frames and M is the total number of frequency bands. An N*M cross-correlation matrix can be obtained through calculation.
[0170] For the video sample part, OpenCV is used to segment the video samples into frames to extract valid images. For each image, CNN is used to extract appearance features, and then LSTM is used to learn temporal features, thereby realizing the video sample vector output.
[0171] Figure 8 A schematic diagram illustrating the training of video sample feature extraction provided in this application embodiment, as shown below. Figure 8 As shown:
[0172] The video samples of pigs are input into the first set model. Based on the video samples, multiple frames of images of pigs are obtained. For each frame, the appearance features are extracted using CNN. Then, the results extracted by CNN are input into LSTM to learn temporal features. Finally, the results are output as image vectors.
[0173] Step 603: Calculate the loss value based on the first output result, and update the weight parameters of the first set model based on the loss value.
[0174] Here, the loss value is calculated based on the first output result, and the weight parameters of the first set model are updated based on the loss value. The loss value between the first output result and the corresponding label is calculated. If the loss value is too large, it means that the fit of the first set model is relatively poor and the output result of the first set model still has a large error. Therefore, it is necessary to update the weight parameters of the first model based on the loss value so that the first output result is as close as possible to the label value. Only in this way can the trained model have good discrimination ability.
[0175] In one embodiment, the audio features corresponding to the audio sample further include at least one of the following:
[0176] The spectrogram corresponding to the audio sample;
[0177] The Mel frequency cepstral features corresponding to the audio samples;
[0178] The first-order difference feature corresponding to the audio sample;
[0179] The second-order difference features corresponding to the audio samples.
[0180] Here, in addition to the cross-correlation matrix features, the audio features of the audio samples also include at least one of the following: the spectrogram corresponding to the audio sample, the Mel frequency cepstral feature corresponding to the audio sample, the first-order difference feature corresponding to the audio sample, and the second-order difference feature corresponding to the audio sample.
[0181] Spectrograms are obtained by segmenting and windowing the original audio signal into multiple frames, performing a Fast Fourier Transform (FFT) on each frame to convert the time-domain signal to the frequency-domain signal, and then stacking the FFT-derived frequency-domain signals of each frame in time. Spectrograms effectively extract the time-domain and frequency-domain features of audio samples and display them as images. After obtaining the two-dimensional spectrogram of the audio sample, the spectrogram is saved as a 227*227*3 RGB color image.
[0182] Mel-frequency cepstral characteristics (MFCCs) represent the short-time power spectrum of a speech signal. They are obtained by performing a linear cosine transform on a non-linear Mel scale of frequency from the logarithmic power spectrum of the speech signal. MFCCs are primarily used to extract static features from audio samples and reduce computational dimensionality. The MFCC is generally derived through the following steps: pre-emphasis, framing, windowing, FFT, Mel filter bank, and Discrete Cosine Transform (DCT). Finally, the 2nd to 13th coefficients of the calculated result are retained; these 12 coefficients constitute the MFCC.
[0183] First-order difference (Deltas) features, also known as differential coefficients, are used to describe the dynamic features of audio samples.
[0184] Second-order difference (Deltas-deltas) features, also known as acceleration coefficients, are used to describe the dynamic characteristics of audio samples.
[0185] Figure 9 A schematic diagram illustrating the training of audio features extracted from audio samples according to an embodiment of this application, as shown below. Figure 9 As shown:
[0186] Extract speech features from audio samples, such as spectrograms, MFCCs, first-order differences, second-order differences, and cross-correlation matrices, and input these speech features into the first set model.
[0187] For the spectrogram, the CRNN algorithm is used for training, combined with a fully connected (FC) layer to obtain speech vectors. The cross-correlation matrix, MFCC, first-order difference, and second-order difference features are combined to obtain HFSs features. These HFSs features are then input into a three-layer FC layer for training to generate static and dynamic feature vectors. Finally, the speech vectors and the static and dynamic feature vectors are concatenated to obtain the speech output.
[0188] To implement the method of the embodiments of this application, the embodiments of this application also provide a behavior monitoring device. Figure 10 For a schematic diagram of the behavior monitoring device provided in the embodiments of this application, please refer to [link / reference]. Figure 10 The device includes:
[0189] Extraction unit 1001 is configured to extract at least one second audio from a first audio; the first audio represents a sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects.
[0190] Input unit 1002 is used to input each of the at least one second audio and the corresponding second video into a first set model to obtain a first behavioral feature corresponding to each of the at least two monitoring objects;
[0191] Matching unit 1003 is used to match the first behavioral feature corresponding to each of the at least two monitored objects with a first preset behavioral feature to obtain a first behavioral monitoring result; wherein,
[0192] The second video represents a video that has been captured of the corresponding monitored object.
[0193] In one embodiment, the matching unit 1003 is further configured to determine that the first behavioral feature matches a first preset behavioral feature if the first behavioral feature satisfies at least one of the following conditions:
[0194] The first spectrogram contains speech signals with amplitudes greater than a set threshold; wherein, the first spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the first behavioral feature;
[0195] Based on the second video corresponding to the monitoring object corresponding to the first behavioral feature, it is determined that the corresponding monitoring object has performed the set behavior.
[0196] In one embodiment, the device further includes: a second matching unit, configured to input the corresponding second audio and the corresponding second video into a second set model to obtain the second behavioral feature corresponding to the monitoring object when the first behavior monitoring result indicates that there is a first behavioral feature corresponding to the monitoring object that matches the first set behavioral feature;
[0197] The obtained second behavioral features are matched with the second set behavioral features to obtain the second behavioral monitoring results for the corresponding monitoring object; wherein...
[0198] The second set behavioral characteristics characterize the abnormal behavior of the monitored object.
[0199] In one embodiment, the second matching unit is further configured to determine that the second behavioral feature matches the second set behavioral feature if the obtained second behavioral feature satisfies at least one of the following conditions:
[0200] In the second spectrogram, the time interval between occurrences of speech signals with amplitudes greater than a set threshold is less than a set time interval; wherein, the second spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the second behavioral feature;
[0201] The duration of a speech signal with an amplitude greater than a set threshold in the second spectrogram is longer than the set duration.
[0202] In one embodiment, the device further includes: a second determining unit, configured to determine the monitoring object corresponding to the second behavioral feature based on the audio encoding of the second audio of the monitoring object corresponding to the second behavioral feature, when the second behavioral monitoring result indicates that the second behavioral feature matches the second set behavioral feature.
[0203] In one embodiment, the device further includes: an acquisition unit, configured to determine the monitoring object corresponding to the second audio based on the audio encoding of the second audio;
[0204] Obtain the second video corresponding to the monitored object.
[0205] In one embodiment, the device further includes: a storage unit, configured to input the sound emitted by each monitored object into a set voice encoder to obtain the audio encoding of the sound emitted by each monitored object;
[0206] Store the correspondence between each monitored object and the audio code of the emitted sound.
[0207] In practical applications, the extraction unit 1001, the input unit 1002, the matching unit 1005, the second matching unit, the second determining unit, the acquisition unit, and the storage unit can be implemented by a processor in the terminal, such as a central processing unit (CPU), a digital signal processor (DSP), a microcontroller unit (MCU), or a field-programmable gate array (FPGA).
[0208] It should be noted that the behavior monitoring device provided in the above embodiments is only illustrated by the division of the above program modules when displaying information. In actual applications, the above processing can be assigned to different program modules as needed, that is, the internal structure of the device can be divided into different program modules to complete all or part of the processing described above. In addition, the behavior monitoring device and behavior monitoring method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.
[0209] To implement the method of the embodiments of this application, the embodiments of this application also provide a model training apparatus. Figure 11 For a schematic diagram of the model training apparatus provided in the embodiments of this application, please refer to [link / reference]. Figure 11 The device includes:
[0210] The acquisition unit 1101 is used to acquire audio samples and video samples of the monitored object; the audio sample represents the sound emitted by the monitored object; the video sample represents a video of the monitored object captured simultaneously with the audio sample.
[0211] Input unit 1102 is used to input the audio features corresponding to the audio sample and the video sample into a first set model to obtain a first output result; the first output result represents the first behavioral feature corresponding to the monitored object;
[0212] The calculation unit 1103 is used to calculate a loss value based on the first output result, and update the weight parameters of the first set model based on the loss value; wherein,
[0213] The audio features corresponding to the audio sample include cross-correlation matrix features, which represent the correlation coefficients between two adjacent frames in the spectrogram corresponding to the audio sample.
[0214] In one embodiment, the audio features corresponding to the audio sample further include at least one of the following:
[0215] The spectrogram corresponding to the audio sample;
[0216] The Mel frequency cepstral features corresponding to the audio samples;
[0217] The first-order difference feature corresponding to the audio sample;
[0218] The second-order difference features corresponding to the audio samples.
[0219] Based on the hardware implementation of the above program modules, and in order to implement the method of the embodiments of this application, the embodiments of this application also provide an electronic device. Figure 12 This is a schematic diagram of the hardware composition structure of the electronic device provided in the embodiments of this application, such as... Figure 12 As shown, the electronic device includes:
[0220] The communication interface 1201 enables information exchange with other devices, such as network devices.
[0221] The processor 1202 is connected to the communication interface 1201 to enable information interaction with other devices and, when running a computer program, executes the methods provided by one or more of the aforementioned terminal-side technical solutions. The computer program is stored in the memory 1203.
[0222] Specifically, the processor 1202 is configured to extract at least one second audio from a first audio; the first audio represents a sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects; input each of the at least one second audio and its corresponding second video into a first preset model to obtain a first behavioral feature corresponding to each of the at least two monitored objects; match the first behavioral feature corresponding to each of the at least two monitored objects with the first preset behavioral feature to obtain a first behavioral monitoring result; wherein, the second video represents a video of the corresponding monitored object being captured.
[0223] In one embodiment, the processor 1202 is further configured to determine that the first behavioral feature matches a first preset behavioral feature if the first behavioral feature satisfies at least one of the following conditions:
[0224] The first spectrogram contains speech signals with amplitudes greater than a set threshold; wherein, the first spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the first behavioral feature;
[0225] Based on the second video corresponding to the monitoring object corresponding to the first behavioral feature, it is determined that the corresponding monitoring object has performed the set behavior.
[0226] In one embodiment, the processor 1202 is further configured to input the corresponding second audio and the corresponding second video into a second set model when the first behavior monitoring result characterizes a first behavior feature corresponding to the monitored object that matches the first set behavior feature, thereby obtaining the second behavior feature corresponding to the monitored object.
[0227] The obtained second behavioral features are matched with the second set behavioral features to obtain the second behavioral monitoring results for the corresponding monitoring object; wherein...
[0228] The second set behavioral characteristics characterize the abnormal behavior of the monitored object.
[0229] In one embodiment, the processor 1202 is further configured to determine that the second behavioral feature matches the second set behavioral feature if the obtained second behavioral feature satisfies at least one of the following conditions:
[0230] In the second spectrogram, the time interval between occurrences of speech signals with amplitudes greater than a set threshold is less than a set time interval; wherein, the second spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the second behavioral feature;
[0231] The duration of a speech signal with an amplitude greater than a set threshold in the second spectrogram is longer than the set duration.
[0232] In one embodiment, the processor 1202 is further configured to determine the monitoring object corresponding to the second behavior feature based on the audio encoding of the second audio of the monitoring object corresponding to the second behavior feature when the second behavior monitoring result characterizes the second behavior feature to match the second set behavior feature.
[0233] In one embodiment, after extracting at least one second audio from the first audio, the processor 1202 is further configured to determine the monitoring object corresponding to the second audio based on the audio encoding of the second audio;
[0234] Obtain the second video corresponding to the monitored object.
[0235] In one embodiment, before extracting at least one second audio from the first audio, the processor 1202 is further configured to input the sound emitted by each monitored object into a set speech encoder to obtain the audio code of the sound emitted by each monitored object;
[0236] Store the correspondence between each monitored object and the audio code of the emitted sound.
[0237] In one embodiment, the processor 1202 is further configured to acquire audio samples and video samples of the monitored object; the audio samples represent the sound emitted by the monitored object; the video samples represent a video of the monitored object captured simultaneously with the audio samples;
[0238] The audio features corresponding to the audio sample and the video sample are input into a first set model to obtain a first output result; the first output result represents the first behavioral feature corresponding to the monitored object.
[0239] The loss value is calculated based on the first output result, and the weight parameters of the first set model are updated based on the loss value; wherein,
[0240] The audio features corresponding to the audio sample include cross-correlation matrix features, which represent the correlation coefficients between two adjacent frames in the spectrogram corresponding to the audio sample.
[0241] In one embodiment, the audio features corresponding to the audio sample further include at least one of the following:
[0242] The spectrogram corresponding to the audio sample;
[0243] The Mel frequency cepstral features corresponding to the audio samples;
[0244] The first-order difference feature corresponding to the audio sample;
[0245] The second-order difference features corresponding to the audio samples.
[0246] Of course, in practical applications, the various components in the electronic device are coupled together through the bus system 1204. It can be understood that the bus system 1204 is used to realize the connection and communication between these components. In addition to the data bus, the bus system 1204 also includes a power bus, a control bus, and a status signal bus. However, for the sake of clarity, in... Figure 12 The general designated all buses as Bus System 1204.
[0247] The memory 1203 in this embodiment is used to store various types of data to support the operation of the electronic device. Examples of such data include any computer program used to operate on the electronic device.
[0248] It is understood that memory 1203 can be volatile memory or non-volatile memory, or both. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), ferromagnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); magnetic surface memory can be disk storage or magnetic tape storage. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), and Direct Rambus Random Access Memory (DRRAM).The memory 1203 described in the embodiments of this application is intended to include, but is not limited to, these and any other suitable types of memory.
[0249] The methods disclosed in the embodiments of this application can be applied to or implemented by the processor 1202. The processor 1202 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit of the hardware in the processor 1202 or by instructions in the form of software. The processor 1202 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor 1202 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in the embodiments of this application can be directly manifested as being executed by a hardware decoding processor, or being executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium, which is located in the memory 1203. The processor 1202 reads the program in the memory 1203 and completes the steps of the aforementioned method in conjunction with its hardware.
[0250] When the processor 1202 executes the program, it implements the corresponding processes in the various methods of the embodiments of this application.
[0251] In an exemplary embodiment, this application also provides a storage medium, namely a computer storage medium, specifically a computer-readable storage medium, such as a memory 1203 storing a computer program, which can be executed by a processor 1202 to complete the steps described in the aforementioned method. The computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface memory, optical disc, or CD-ROM.
[0252] In the several embodiments provided in this application, it should be understood that the disclosed apparatus, terminal, and method can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.
[0253] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.
[0254] In addition, each functional unit in the various embodiments of this application can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.
[0255] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media that can store program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.
[0256] Alternatively, if the integrated units described above are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.
[0257] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A behavior monitoring method, characterized in that, The method includes: At least one second audio is extracted from a first audio using a speech separation model and the audio encoding of each monitored object; the first audio represents the sound emitted by at least two monitored objects; each of the at least one second audio corresponds to the sound emitted by one of the at least two monitored objects; the audio encoding of each monitored object is used to uniquely identify the corresponding monitored object. Each of the at least one second audio and its corresponding second video is input into a first set model to obtain a first behavioral feature corresponding to each of the at least two monitored objects; the first set model is used to identify whether the monitored object has coughing behavioral features. The first behavioral feature corresponding to each of the at least two monitored objects is matched with a first set behavioral feature to obtain a first behavioral monitoring result; wherein, the second video represents a video of the corresponding monitored object; the first behavioral feature is matched with the first set behavioral feature when the first behavioral feature meets a first condition; the first condition includes: the presence of a speech signal with an amplitude greater than a set threshold in the first spectrogram, and / or, determining that the corresponding monitored object has performed a set behavior based on the second video of the monitored object corresponding to the first behavioral feature; the first spectrogram is the spectrogram of the second audio of the monitored object corresponding to the first behavioral feature; the set behavior includes one or more of body shaking, back arching, and hind limb shaking; If the first behavior monitoring result indicates that the first behavior feature of the monitored object matches the first set behavior feature, the corresponding second audio and the corresponding second video are input into the second set model to obtain the corresponding second behavior feature of the monitored object; the second set model is used to identify whether the coughing behavior of the monitored object is caused by illness. The obtained second behavioral feature is matched with the second set behavioral feature to obtain the second behavioral monitoring result for the corresponding monitoring object; wherein, the second set behavioral feature represents the abnormal behavior of the corresponding monitoring object; the second behavioral feature matches the second set behavioral feature when the obtained second behavioral feature meets the second condition; the second condition includes: the time interval of the speech signal with amplitude greater than the set threshold in the second spectrogram is less than the set time interval, the duration of the speech signal with amplitude greater than the set threshold in the second spectrogram is greater than the set duration, and the monitoring object is judged to have performed one or more of the first set behaviors based on the second video corresponding to the monitoring object corresponding to the second behavioral feature; the second spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the second behavioral feature; the first set behaviors include one or more of the following: open-mouth panting, drooling from the mouth and nose, dog sitting posture, and abdominal breathing.
2. The behavior monitoring method according to claim 1, characterized in that, The method further includes: If the second behavior monitoring result indicates that the second behavior feature matches the second set behavior feature, the monitoring object corresponding to the second behavior feature is determined based on the audio encoding of the second audio of the monitoring object corresponding to the second behavior feature.
3. The behavior monitoring method according to claim 1, characterized in that, After extracting at least one second audio from the first audio, the method further includes: The monitoring object corresponding to the second audio is determined based on the audio encoding of the second audio. Obtain the second video corresponding to the monitored object.
4. The behavior monitoring method according to claim 2 or 3, characterized in that, Before extracting at least one second audio from the first audio, the method further includes: The sound emitted by each monitored object is input into a set voice encoder to obtain the audio code of the sound emitted by each monitored object; Store the correspondence between each monitored object and the audio code of the emitted sound.
5. The behavior monitoring method according to claim 1, characterized in that, The training methods for the first model include: Acquire audio and video samples of the monitored object; the audio sample represents the sound emitted by the monitored object; the video sample represents a video of the monitored object captured simultaneously with the audio sample. The audio features corresponding to the audio sample and the video sample are input into a first set model to obtain a first output result; the first output result represents a first behavioral feature corresponding to the monitored object; the first set model is used to identify whether the monitored object has coughing behavioral features. The loss value is calculated based on the first output result, and the weight parameters of the first set model are updated based on the loss value; wherein, The audio features corresponding to the audio sample include cross-correlation matrix features, which represent the correlation coefficients between two adjacent frames in the spectrogram corresponding to the audio sample.
6. The behavior monitoring method according to claim 5, characterized in that, The audio features corresponding to the audio sample also include at least one of the following: The spectrogram corresponding to the audio sample; The Mel frequency cepstral features corresponding to the audio samples; The first-order difference feature corresponding to the audio sample; The second-order difference features corresponding to the audio samples.
7. A behavior monitoring device, characterized in that, The device includes: An extraction unit is configured to extract at least one second audio from a first audio using a speech separation model and the audio encoding of each monitored object; the first audio represents a sound emitted by at least two monitored objects; each of the at least one second audio corresponds to a sound emitted by one of the at least two monitored objects; the audio encoding of each monitored object is used to uniquely identify the corresponding monitored object. An input unit is used to input each of the at least one second audio and the corresponding second video into a first preset model to obtain a first behavioral feature corresponding to each of the at least two monitoring objects; the first preset model is used to identify whether the monitoring object has coughing behavioral features. A matching unit is configured to match a first behavioral feature corresponding to each of the at least two monitored objects with a first set behavioral feature to obtain a first behavioral monitoring result; wherein the second video represents a video of the corresponding monitored object; the first behavioral feature is matched with the first set behavioral feature when the first behavioral feature meets a first condition; the first condition includes: the presence of a speech signal with an amplitude greater than a set threshold in the first spectrogram, and / or, determining that the corresponding monitored object has performed a set behavior based on the second video of the monitored object corresponding to the first behavioral feature; the first spectrogram is the spectrogram of the second audio corresponding to the monitored object corresponding to the first behavioral feature; the set behavior includes one or more of body shaking, back arching, and hind limb shaking; and is further configured to input the corresponding second audio and the corresponding second video into a second set model when the first behavioral monitoring result indicates that the first behavioral feature of the monitored object matches the first set behavioral feature to obtain the corresponding monitored object. The corresponding second behavioral feature; the second set model is used to identify whether the coughing behavior of the monitored object is caused by illness; the obtained second behavioral feature is matched with the second set behavioral feature to obtain the second behavioral monitoring result of the corresponding monitored object; wherein, the second set behavioral feature represents the abnormal behavior of the corresponding monitored object; the second behavioral feature is matched with the second set behavioral feature when the obtained second behavioral feature meets the second condition; the second condition includes: the time interval of the speech signal with amplitude greater than the set threshold in the second spectrogram is less than the set time interval, the duration of the speech signal with amplitude greater than the set threshold in the second spectrogram is greater than the set duration, and the monitoring object is judged to have one or more of the first set behaviors based on the second video corresponding to the monitoring object corresponding to the second behavioral feature; the second spectrogram is the spectrogram of the second audio corresponding to the monitoring object corresponding to the second behavioral feature; the first set behaviors include one or more of the following: open-mouth panting, nasal and oral drooling, dog sitting posture, and abdominal breathing.
8. An electronic device, characterized in that, include: A processor and memory for storing computer programs that can run on the processor, wherein, When the processor is used to run the computer program, it performs the steps of the method according to any one of claims 1-6.
9. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-6.