A non-contact co-participant artificial intelligence analysis system

By designing a non-contact artificial intelligence analysis system, and utilizing the SwinTransformer and Conformer architecture for feature extraction and recognition of audio and video data, the system solves the problem of inaccuracy in existing recognition systems. It achieves accurate recognition of information such as speaker identity, age, and gender, and provides an integrated, portable computing system capable of objectively providing scores.

CN116108396BActive Publication Date: 2026-06-26DUKE KUNSHAN UNIVERSITY +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DUKE KUNSHAN UNIVERSITY
Filing Date
2022-10-21
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies lack accurate, lightweight, and portable recognition systems, making it difficult to identify the speaker's identity, age, and gender in audio and video data. Furthermore, manual scoring suffers from inconsistent evaluation standards and insufficient manpower.

Method used

Design a non-contact collaborative artificial intelligence analysis system, including a training data acquisition module, an active range monitoring module, an identity recognition module, an image feature extraction module, a speech feature processing module, and a fusion prediction module. The system utilizes the SwinTransformer and Conformer architectures for feature extraction and recognition, and combines multimodal fusion prediction scores.

Benefits of technology

It achieves accurate identification of speaker identity, age, and gender information in audio and video data, provides an integrated and portable computing system, can objectively provide scores, and improves recognition accuracy and system portability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116108396B_ABST
    Figure CN116108396B_ABST
Patent Text Reader

Abstract

The application provides a non-contact co-participation artificial intelligence analysis system, which comprises a training data acquisition module, an active range monitoring module, an identity recognition module, an image feature extraction module, a voice feature processing module, a feature sequence processing module and a fusion prediction module, is used for fusing prediction values of active voice data and active image data, training various scores of the experimental personnel, and obtaining a fusion prediction large model. The application can score the expressions of the experimental personnel in the experimental process through artificial intelligence, and help doctors and experts to give evaluation and suggestion for parents and children.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image and speech recognition technology, and more specifically to a non-contact, collaborative artificial intelligence analysis system. Background Technology

[0002] Currently, tens of millions of children worldwide suffer from autism and other developmental disabilities, causing significant distress to them and their families. Without proper support and intervention, these children face considerable obstacles in their development and social interactions, sometimes even impacting their normal lives. Despite the severity of the problem, most children with developmental disabilities and their families still do not receive the care and support they need.

[0003] For caregivers of children with developmental disabilities, parenting programs are particularly helpful in boosting their confidence and parenting skills. They can also improve the well-being of both caregivers and children. In light of this, the World Health Organization (WHO) and other collaborating agencies have developed a caregiver skills training program for families of children with developmental delays or disabilities. This program employs a family-centered approach, where caregivers conduct activities with the child at home, which are filmed on a mobile phone and then sent online to WHO experts. The experts provide evaluations and recommendations based on human assessment.

[0004] However, manual scoring has a series of limitations, including inconsistent evaluation criteria, insufficient manpower, and difficult-to-operate equipment.

[0005] Currently, various artificial intelligence processing systems lack accurate, lightweight, portable, and mobile recognition systems. There are also some technical shortcomings. For example, in recorded audio and video, it is difficult to identify whether the speaker is an adult or a child, or to recognize specific keywords.

[0006] To address this problem, this invention proposes a non-contact, collaborative artificial intelligence analysis system. Summary of the Invention

[0007] To address the shortcomings of existing technologies, this invention proposes a non-contact, collaborative artificial intelligence analysis system that can accurately and objectively score experimental data through artificial intelligence, while also accurately identifying the identity, age, gender, and other information of different participants in audio and video data.

[0008] The technical solution of this invention is implemented as follows:

[0009] A non-contact, collaborative AI analysis system includes: a training data acquisition module for collecting audio and video data of participants during activities; an activity range monitoring module for filtering active speech data from the audio data and active image data showing human activity from the video data to reduce subsequent data computation; an identity recognition module for determining the identity and gender of participants; an image feature extraction module, including an interpretable feature extraction module and an end-to-end feature extraction module, wherein the interpretable feature extraction module extracts interpretable human behavior features from the video data; a speech feature processing module, including a basic feature processing module and a manual feature extraction module, wherein the basic feature processing module performs basic feature processing on the active speech data, and the manual feature extraction module extracts interpretable manual features from the active speech data; a feature sequence processing module for combining the active speech data and active image data obtained by the activity range detection module to obtain a feature time series with activity information; and a fusion prediction module for fusing the predicted values ​​of the active speech data and active image data, training the system on various scores of the participants to obtain a large-scale fusion prediction model.

[0010] Furthermore, the training data acquisition module includes several video recording devices and audio recording devices, which are arranged around the experimenter.

[0011] Furthermore, the active range detection module includes a video active range detection module and an audio detection range extraction module; the video active range detection module includes a human image detection function and an age detection function. The age detection function is used to predict the bounding box of the experimenter to determine whether the experimenter is an adult or a minor and to label the experimenter accordingly. The human image detection function is used to detect the minimum active range matrix corresponding to different experimenters in the image; the audio active range detection module includes a specific speaker recognition system. This system is used to initially obtain speaker log results through clustering methods and further improve the accuracy of the results through specific speaker detection technology to identify overlapping regions where multiple people speak simultaneously.

[0012] Furthermore, the identity recognition module includes a voice age recognition function and an image age recognition function. The voice age recognition function is used to convert the input one-dimensional audio signal into two-dimensional acoustic features, convert the acoustic features into a high-dimensional feature map, and then pass the high-dimensional feature map through two different attention mechanism encoding layers to obtain the speaker's identity information, gender information, and age information.

[0013] Furthermore, the basic feature processing module includes speech-to-text recognition, speech keyword detection, and speech-to-expression recognition. The speech recognition function is based on a speech recognition model with a Conformer architecture, which is used to acquire speech-to-text and perform word segmentation. The speech keyword detection function is used to detect keywords that are important for rating and evaluation. The speech-to-expression recognition function uses an emotional speech dataset to train a speech emotion recognizer, which is used to collect and classify the expressions of experimental subjects.

[0014] Furthermore, the interpretable character behavior features include skeletal behavior features and / or relative position features.

[0015] This invention also proposes a screening and scoring method for developmental disorders in children, comprising a non-contact, collaborative artificial intelligence analysis system and a lightweight module as described in any of the preceding embodiments, including the following steps:

[0016] Training phase: Collect experimental data, input the experimental data into the non-contact collaborative artificial intelligence analysis system for training to obtain a fusion prediction large model, and obtain a small model through the lightweight module;

[0017] Prediction phase: Observation is carried out using an integrated data collection table, with a hidden camera installed in the middle of the data collection table, and / or a side camera that can observe the activities of the caregiver and the child is placed within 1-3 meters outside the data collection table;

[0018] The caregiver and the child sit facing each other and play. Toys are placed on the collection table, including: picture books, cardboard boxes, animal models, paintbrushes, drawing paper, and hand puppets.

[0019] Once recording mode is activated, caregivers and children can use a small table as their activity space and provided toys as their tools to conduct experimental activities.

[0020] The scoring data obtained during the recording process is input into the small model to obtain a prediction of the interaction score between the two people.

[0021] Furthermore, the experimental activities include: caregivers can tell stories to children using picture books and interact with the children about the story content; caregivers can play cardboard stacking games with children; caregivers can race with children using animal models; caregivers can accompany children to draw with paintbrushes; caregivers can take turns telling stories with children using hand puppets; and caregivers and children can engage in a series of free play activities.

[0022] Compared with the prior art, the present invention has the following advantages.

[0023] This invention proposes an integrated, portable, scalable, and non-invasive computing system that requires no constraints and can be freely used by caregivers and children.

[0024] A specific speaker recognition system is also proposed;

[0025] A wake word detection method is also proposed.

[0026] A binary classification prediction method for speech age, targeting adults and children, is also proposed.

[0027] A lightweight knowledge distillation method for converting multi-camera data into single-camera data is also proposed.

[0028] The specific beneficial effects are described below. Attached Figure Description

[0029] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0030] Figure 1 This is a framework diagram of a non-contact collaborative artificial intelligence analysis system according to the present invention;

[0031] Figure 2 This is a data change flowchart of a non-contact collaborative artificial intelligence analysis system according to the present invention;

[0032] Figure 3 This is a schematic diagram of the workflow of the basic feature extraction module in this invention;

[0033] Figure 4 This is a diagram of the deep learning network framework for the speech keyword detection function in this invention. Detailed Implementation

[0034] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0035] In the description of this invention, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first," "second," "third," and "fourth," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0036] In the description of this invention, it should be noted that, unless otherwise explicitly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0037] See Figures 1 to 2 This invention discloses a non-contact collaborative artificial intelligence analysis system, comprising:

[0038] The training data acquisition module is used to collect audio and video data from the participants during the activity.

[0039] An active range monitoring module is used to filter out active voice data in the audio data and active image data with human activity in the video data, so as to reduce the amount of subsequent data calculation.

[0040] The identity recognition module is used to determine the identity and gender of the experimenters;

[0041] The image feature extraction module includes an interpretable feature extraction module and an end-to-end feature extraction module. The interpretable feature extraction module is used to extract interpretable human behavior features from the video data. The end-to-end feature extraction module is based on a modified action recognition model to obtain an end-to-end feature extraction network. Specifically, SwinTransformer is used as the backbone network to extract features in an end-to-end manner. A specific number of frames are clipped into a short video, and features are extracted from this short video to obtain a feature sequence. For a valid complete video, a feature sequence composed of features extracted from multiple short videos can be obtained. Temporal pooling is performed on the temporal channel to obtain the action features representing the entire video. In this embodiment, the input size is 5*32*3*244*244, and five 64-frame clips are clipped into a 10-second video, with each frame being 3*244x244 in size. The output size is 1*768. For a valid video, multiple action features are combined into a sequence. Then, temporal average pooling is performed on the temporal channel to finally obtain a 1×768 action embedding representing the entire video. The Swin Transformer backbone is pre-trained on the kinetics400 dataset. The backbone can extract motion features from approximately 300K action videos (each video is about 10 seconds long). The videos in the dataset must be pre-processed; wild videos are filtered into valid videos based on time labels, and then the 10-second activity-segmented videos are fed into the motion feature extraction module.

[0042] The speech feature processing module includes a basic feature processing module and a manual feature extraction module. The basic feature processing module is used to perform basic feature processing on the active speech data, and the manual feature extraction module extracts interpretable manual features from the active speech data.

[0043] The feature sequence processing module is used to combine the active speech data and active image data obtained by the active range detection module to obtain a feature time series with active information.

[0044] The fusion prediction module is used to fuse the predicted values ​​of the active speech data and the active image data, and trains the model on the scores of the experimenters to obtain a large-scale fusion prediction model. Specifically, for each input sample, the audio system extracts multi-dimensional text features for both the child and the caregiver (30 dimensions in this embodiment), while the video system compresses the entire visual content into a 768-dimensional embedding vector. Then, principal component analysis (PCA) is used to reduce the dimension of the visual embedding vector to 64 dimensions. A linear regression model is used to directly predict the scores based on the extracted feature vectors. First, we train the machine learning independently on the audio and video features. Then, we perform score-level fusion of the predictions and the results of the two systems.

[0045] The final multimodal fusion achieves performance gains by combining the advantages of both modes.

[0046] Specifically, there are a total of 10 scores.

[0047] Scores for children include:

[0048] (1) Non-participation level:

[0049] (2) Participation:

[0050] (3) Stereotyped behaviors and repetitive behaviors:

[0051] (4) Attention to caregivers:

[0052] (5) Ability to initiate social interactions:

[0053] (6) Language expression ability and pragmatic ability

[0054] The scores for caregivers include:

[0055] (7) Support

[0056] (8) Follow the child's attention

[0057] (9) The influence of caregivers

[0058] The shared score for the interaction between the two includes:

[0059] (10) Smoothness and adhesion

[0060] In a specific implementation, the training data acquisition module includes several video recording devices and audio recording devices, which are arranged around the experimenters.

[0061] In a further implementation, the active range detection module includes a video active range detection module and an audio detection range extraction module. The video active range detection module includes a human image detection function and an age detection function. The age detection function is used to predict the bounding box of the experimenter to determine whether the experimenter is an adult or a minor and to label the experimenter accordingly. The human image detection function is used to detect the minimum active range matrix corresponding to different experimenters in the image. Specifically, the human image detection function obtains the coordinates of the upper left and lower right corners of the bounding box rectangles of the two experimenters in the image as (x11, y11), (x12, y12), (x21, y21), and (x22, y22), respectively. Based on this, the minimum rectangle that can completely cover the set of active ranges of the two people can be obtained. The upper left corner coordinates of the minimum rectangle are (xmin, ymin), and the lower right corner coordinates are (xmax, ymax).

[0062] x min =min(x 11 x 12 x 21 x 22 ),

[0063] x max =max(x 11 x 12 x 21 x 22 ),

[0064] y min =min(y 11 y 12 y 21 y 22 ),

[0065] y max =max(y 11 y 12 y 21 y 22 ).

[0066] Based on the age detection function, the bounding boxes of the two detected individuals are predicted separately, and the two individuals are classified as adults or minors. This allows the two individuals to be labeled separately, resulting in activity image ranges for the caregiver and the child with identification tags.

[0067] In a further implementation, human image detection uses the HOG+Cascade model.

[0068] This invention also proposes a speaker recognition system based on a neural network. The audio activity range detection module includes a specific speaker recognition system. This system is used to initially obtain speaker log results through clustering methods and further improve the accuracy of the results through specific speaker detection technology, identifying overlapping intervals where multiple people speak simultaneously. Specifically, the clustering method first removes silent parts through Voice Activity Detection (VAD), then uniformly segments the speech into 1.28s segments, and extracts the voiceprint of each segment. This invention proposes using a neural network to predict the similarity between the voiceprints of each segment and obtain a similarity matrix. Finally, spectral clustering is used to predict the number of speakers and cluster each segment to obtain preliminary speaker log results. Since traditional clustering methods can usually only cluster samples into a specific class, when a segment contains multiple speakers, the speakers are usually not correctly identified, and only one speaker can be identified.

[0069] To address this problem, this invention proposes a speaker-specific recognition system. Based on speaker log results obtained through clustering, we can extract the voiceprint of each individual, also known as the speaker-specific voiceprint. Next, a front-end identical to the voiceprint extracts frame-level voiceprints from the speech, and these are concatenated with the speaker-specific voiceprint. Finally, a neural network based on Transformer or Long Short-Term Memory (LSTM) is used to predict the outcome for each speaker-specific voiceprint. Since this method can output results for multiple speakers for each timestamp, the speaker-specific recognition system can detect different speakers even when multiple people are speaking simultaneously, significantly improving the accuracy of the results.

[0070] This paper further introduces speech recognition technology into a speaker-specific recognition system to provide information that the speaker logging system cannot learn. A speech recognition system trained on a large-scale dataset is used to extract features containing textual information, and two different methods are employed to improve the results of the speaker logging system. First, textual information features are concatenated with frame-level voiceprints, and the backend predicts the results, hoping that the backend can improve performance by learning textual information features. Second, using the features extracted by the voiceprint network as input, a separate small-scale network is used to predict textual information features. This small-scale network is trained simultaneously with the speaker-specific detection network, allowing the features learned by the voiceprint network to include some textual information, thereby optimizing the logging results.

[0071] In another preferred embodiment, to ensure the real-time performance of the speaker-specific recognition system, an online speaker-specific recognition system is proposed. This system uses the same neural network as the offline speaker-specific detection and logging system, but employs a different training method to achieve online performance. During training, the input consists of two different speech segments: the first segment is used to extract the speaker-specific voiceprint, and the second segment is used to extract the frame-level voiceprint. The voiceprint concatenation method and backend prediction remain the same as in the offline system. During inference, a buffer of a certain length is used to receive speech. Whenever a fixed-length speech segment is received, the frame-level voiceprint is extracted. The previously extracted speaker-specific voiceprint and frame-level voiceprint are used to extract the result. Finally, the predicted result and frame-level voiceprint are used to update the speaker-specific voiceprint, which is then used as the input for the next speech segment. Speech recognition technology can also be integrated into this online speaker-specific detection and logging system, and the method is similar to that of the offline system.

[0072] This invention also proposes a binary classification prediction method for voice age (adult / child). Specifically, the identity recognition module includes voice age recognition and image age recognition functions. See [link to relevant documentation]. Figure 3 The voice age recognition function is as follows:

[0073] S1. Transform the one-dimensional audio signal into two-dimensional acoustic features through feature extraction;

[0074] S2. Acoustic features are processed through a backbone network to obtain a high-dimensional feature map;

[0075] S3. The high-dimensional feature map is passed through two encoding layers with different attention mechanisms. One encoding layer is responsible for finding feature information related to speaker identity; the other encoding layer is responsible for finding information related to gender and age. Since age and gender are strongly correlated with speaker information, and speaker recognition is a difficult task, using speaker recognition as an auxiliary network will prevent the overall network from getting stuck in local optima of the dataset, thus reducing its generalization ability.

[0076] During testing, cosine similarity scoring was used to determine speaker identity. The test audio data was scored against the gender weights of the gender identifier and the adult / child weights of the age identifier. The gender similarity score vector and the adult / child similarity vector were multiplied by a dot product to obtain a 4x4 similarity matrix, where the highest probability indicates the attribute of the test speech (adult male, adult female, minor male, minor female). A speaker recognition network was also obtained. When multiple speakers with the same attributes but different identity information are mixed together, the learned speaker information can be used to perform speaker classification and speaker log functions.

[0077] In a specific implementation, the basic feature processing module includes speech-to-text recognition, speech keyword detection, and speech-to-expression recognition.

[0078] Specifically, the speech-to-text recognition function uses the WeNet toolkit to build a speech recognition model with a Conformer architecture, trained on a Chinese speech database of approximately 20,000 hours. In this implementation, the database includes WENETSPEECH, the Aisell-2 corpus, MAGICDATA, and 33 hours of internally collected CPEP-3 data. The obtained speech-to-text results are then segmented into words using the Jieba toolkit and the MDBG Chinese-English dictionary.

[0079] The speech expression recognition function uses the Emotional Speech Data Set (ESD) to train the speech emotion recognizer. Logfbank features are extracted from the ESD using librosa-cite. Next, ResNet-18 is used to encode the audio features. Emotion classification is performed by stacking several fully connected layers into the feature encoder.

[0080] Furthermore, the manual feature extraction module, based on the data analysis results obtained from the basic feature processing module, can acquire the following data:

[0081] (1) The total length of the entire event audio.

[0082] (2) The total number of sentences spoken by the caregiver.

[0083] (3) The total number of sentences spoken by children.

[0084] (4) The average number of words per sentence for caregivers.

[0085] (5) The average number of words per sentence for children.

[0086] (6) The average number of words per sentence spoken by the caregiver.

[0087] (7) The average number of words per sentence for children.

[0088] (8) The length of words used by caregivers.

[0089] (9) Word length of children's vocabulary

[0090] (10) Number of speaker changes

[0091] (11) Average number of words per sentence for caregivers

[0092] (12) Median number of words per sentence spoken by caregivers

[0093] (13) The maximum number of words in the caregiver's 5 sentences

[0094] (14) Caregiver's vocabulary

[0095] (15) Average vocabulary per sentence for caregivers

[0096] (16) Median vocabulary size per sentence for caregivers

[0097] (17) The five largest words in each sentence spoken by the caregiver

[0098] (18) The ratio of caregiver's vocabulary to total vocabulary

[0099] (19) The number of words spoken by the caregiver before the caregiver is replaced

[0100] (20) Average number of words per sentence for children

[0101] (21) Median number of words per sentence for children

[0102] (22) The maximum number of words in a child's 5 longest sentences

[0103] (23) Children's vocabulary

[0104] (24) Average vocabulary size per sentence for children

[0105] (25) Median vocabulary size per sentence for children

[0106] (26) The 5 largest vocabulary numbers for each sentence in a child

[0107] (27) The ratio of children's vocabulary size to total word count

[0108] (28) Number of words a child speaks before switching speakers

[0109] (29) Frequency of occurrence of caregiver keywords

[0110] (30) Frequency of children's keywords

[0111] This invention also proposes a wake-word detection method for detecting keywords important for scoring and evaluation. Compared to traditional methods where the wake-word model can only detect the current wake-word once trained and cannot replace it, this invention employs a query-by-example keyword detection scheme. Users only need to register template wake-word audio, and subsequent wake-word detection can be performed using the current template. Building upon basic data augmentation, this invention utilizes multi-channel information to remove noise and reverberation-related content from deep network features, acquiring only deep information relevant to wake-word recognition. 3D convolution is used to perform self-fusion learning on multi-channel data, embedding it into the location of the sound source and filtering out noise data. This method uses... Figure 4The deep learning network framework shown is shown.

[0112] Specifically, in the data preparation stage before training, a large-scale speech recognition model is used to align a large amount of labeled audio data, obtaining the start and end information of the audio corresponding to each word. This allows short audio segments to be associated with words or characters. The length of these short audio segments may vary; this method pads or trims them to a uniform length before providing them to the subsequent neural network training stage. During training, we first perform data augmentation on multi-channel audio from complex acoustic scenes, then extract acoustic features, and then use a front-end 3D convolutional layer to fuse multi-channel information. This information is then passed through a CRNN neural network that can model both local and global information, ultimately outputting an acoustic word embedding. This is then connected to a multi-layer fully connected network to classify the current short audio segment and determine which word it belongs to. After multiple rounds of training, the acoustic word embedding incorporates semantic information from the text and can be used to compare the detected speech stream in the feature space to determine whether it is similar to a registered template. By leveraging 3D convolutions, the spatial location information between multiple channels can be learned to mitigate the performance degradation of neural networks caused by low signal-to-noise ratio audio, thus training a highly robust wake word detection network. During the testing phase, a custom-recorded registration audio is required, which the user needs to pre-register upon device activation. This registration audio is then imported into the neural network to extract high-level semantic embeddings of acoustic words. This embedding serves as the current user's template. Subsequently, the smart device continuously listens to the audio stream using a sliding window. Each slide selects a short audio segment, which is then imported into the neural network to extract high-level semantic embeddings of acoustic words, serving as a query vector. The similarity between the query embedding and the template embedding is calculated to determine the probability of the currently registered wake word's presence. When this probability exceeds a threshold, the wake word is considered detected.

[0113] In a specific implementation, the interpretable human behavioral characteristics include skeletal behavioral characteristics and / or relative positional characteristics.

[0114] This invention also proposes a screening and scoring method for developmental disorders in children, comprising a non-contact, collaborative artificial intelligence analysis system and a lightweight module as described in any of the preceding claims, including the following steps:

[0115] Training Phase: Experimental data is collected and input into the non-contact collaborative AI analysis system for training to obtain a fusion prediction model. This fusion prediction model is then used to generate a smaller model via the lightweight module. Specifically, knowledge distillation is employed. The prediction information from the trained fusion prediction model is used as a label to perform knowledge distillation on the smaller model, resulting in supervised training. This yields a more lightweight smaller model.

[0116] Prediction phase: Observation is carried out using an integrated data collection table, with a hidden camera installed in the middle of the data collection table, and / or a side camera that can observe the activities of the caregiver and the child is placed within 1-3 meters outside the data collection table;

[0117] The caregiver and the child sit facing each other and play. Toys are placed on the collection table, including: picture books, cardboard boxes, animal models, paintbrushes, drawing paper, and hand puppets.

[0118] Once recording mode is activated, caregivers and children can use a small table as their activity space and provided toys as their tools to conduct experimental activities.

[0119] The scoring data obtained during the recording process is input into the small model to obtain a prediction of the interaction score between the two people.

[0120] In a preferred embodiment, the experimental activities include: the caregiver telling stories to the child using picture books and interacting with the child about the story content; the caregiver and the child playing a cardboard stacking game; the caregiver racing with the child using animal models; the caregiver accompanying the child to draw with paintbrushes; the caregiver and the child taking turns telling stories using hand puppets; and the caregiver and the child engaging in a series of free play activities.

[0121] The beneficial effects of this invention are as follows:

[0122] 1. A portable computer system is provided that can accurately and objectively score children's performance so that experts and doctors can provide evaluations and suggestions.

[0123] 2. A specific speaker recognition system is proposed, which effectively improves the accuracy of speaker recognition.

[0124] 3. A wake word detection method is proposed; it can automatically determine the wake words that need to be detected.

[0125] 4. A binary classification prediction method for speech age (adult / child) is proposed; this method can more accurately identify the speaker's identity information and avoid the overall network from getting stuck in local optima of the dataset, thereby reducing its generalization ability.

[0126] 5. A lightweight knowledge distillation method for converting multi-camera data into single-camera data is proposed, which can expand the application scenarios of the non-contact collaborative artificial intelligence analysis system of this invention.

[0127] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A non-contact, collaborative artificial intelligence analysis system, characterized in that, include: The training data acquisition module is used to collect audio and video data from the participants during the activity. An active range detection module is used to filter out active speech data in the audio data and active image data with human activity in the video data. The active range detection module includes a video active range detection module and an audio detection range extraction module. The video active range detection module includes human image detection and age detection functions. The age detection function predicts the bounding box of the experimenter to determine whether the experimenter is an adult or a minor and labels them accordingly. The human image detection function detects the minimum active range matrix corresponding to different experimenters in the image. The audio active range detection module includes a specific speaker recognition system, used to obtain speaker log results through clustering methods and identify overlapping regions where multiple people are speaking simultaneously. An identity recognition module is used to determine the identity and gender of the experimenter; wherein the identity recognition module includes a voice age recognition function and an image age recognition function. The voice age recognition function is used to convert the input one-dimensional audio signal into two-dimensional acoustic features, convert the acoustic features into a high-dimensional feature map, and then pass the high-dimensional feature map through two different attention mechanism encoding layers to obtain the speaker's identity information, gender information, and age information. The image feature extraction module includes an interpretable feature extraction module and an end-to-end feature extraction module. The interpretable feature extraction module is used to extract interpretable human behavior features from the video data. The speech feature processing module includes a basic feature processing module and a manual feature extraction module. The basic feature processing module is used to perform basic feature processing on the active speech data, and the manual feature extraction module is used to extract interpretable manual features from the active speech data. The basic feature processing module includes speech-to-text recognition, speech keyword detection, and speech-to-expression recognition. The speech recognition function is based on a speech recognition model with a Conformer architecture, which is used to acquire speech-to-text and perform word segmentation. The speech keyword detection function is used to detect keywords that are important for rating and evaluation. The speech-to-expression recognition function uses an emotional speech dataset to train a speech emotion recognizer, which is used to collect and classify the expressions of the experimental subjects. The feature sequence processing module is used to combine the active speech data and active image data obtained by the active range detection module to obtain a feature time series with active information. The fusion prediction module is used to fuse the predicted values ​​of the active speech data and the active image data, and train the model based on the scores of the experimenters to obtain a large fusion prediction model.

2. The non-contact collaborative artificial intelligence analysis system according to claim 1, characterized in that, The training data acquisition module includes several video recording devices and audio recording devices, which are arranged around the experimenters.

3. The non-contact collaborative artificial intelligence analysis system according to claim 1, characterized in that, The interpretable character behavior features include skeletal behavior features and / or relative position features.

4. A screening and scoring method for developmental disorders in children, characterized in that, The non-contact, collaborative artificial intelligence analysis system as described in any one of claims 1 to 3 includes the following steps: Training phase: Collect experimental data, input the experimental data into the non-contact collaborative artificial intelligence analysis system for training to obtain a fusion prediction large model, and obtain a small model through a lightweight module; Prediction phase: Observation is carried out using an integrated data collection table, with a camera installed in the middle of the table and / or a side camera placed outside the table that can observe the activities of the caregiver and the child; The caregiver and the child sit facing each other and play, with toys placed on the collection table, including: picture books, cardboard boxes, animal models, paintbrushes, drawing paper, and hand puppets; Once recording mode is activated, caregivers and children use the collection table as their activity area and the provided toys as their tools to conduct experimental activities. The scoring data obtained during the recording process is input into the small model to obtain a prediction of the interaction score between the two people.

5. The method for screening and scoring developmental disorders in children according to claim 4, characterized in that, The experimental activities include: Caregivers use picture books to tell stories to children and interact with them about the stories. Caregivers and children use cardboard stacking games; Caregivers raced animal models against children; Caregivers accompany children as they draw with paintbrushes; Caregivers and children take turns telling stories using hand puppets.