An autism intelligent assessment method based on multi-modal perception and rule inference

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The intelligent autism assessment method based on multimodal perception and rule-based reasoning solves the problems of strong subjectivity, long assessment cycle and insufficient robustness in existing autism diagnosis, and realizes the generation of more accurate and interpretable assessment reports, which are applicable to a variety of medical scenarios.

CN122245794APending Publication Date: 2026-06-19ZHEJIANG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ZHEJIANG UNIV
Filing Date: 2026-05-25
Publication Date: 2026-06-19

Application Information

Patent Timeline

25 May 2026

Application

19 Jun 2026

Publication

CN122245794A

IPC: G16H50/30; G16H20/70; G16H15/00; G06N5/025; G06N5/04

AI Tagging

Application Domain

Health-index calculation Mental therapies

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Current methods for diagnosing autism rely on the subjective observation of professional doctors, which are characterized by long assessment cycles, high costs, limited reproducibility, difficulty in fully utilizing multimodal behavioral information, insufficient robustness, lack of clear scoring criteria in black-box models, and susceptibility to noise in complex environments.

Method used

This study employs a multimodal perception and rule-based reasoning-based intelligent assessment method for autism. It simultaneously collects visual, auditory, and interactive behavioral data through cameras and microphones, and generates item-level scores by combining rule-based reasoning, outputting a clear assessment report.

Benefits of technology

It improves the accuracy and interpretability of autism assessment, reduces manual processing costs, enhances the credibility and clinical applicability of the assessment process, and is suitable for hospital, child development screening, and remote diagnosis scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245794A_ABST

Patent Text Reader

Abstract

This invention discloses an intelligent assessment method for autism based on multimodal perception and rule-based reasoning. It simultaneously collects audio, video, and interaction data of the assessed individual during the assessment process using a camera and microphone. The audio and video data are preprocessed to construct a multimodal behavioral sequence along a unified timeline. Behavioral features such as the percentage of time the assessor is focused, average response delay, proportion of repeated phrases, target following success rate, number of times proactive communication is initiated, and the magnitude of facial expression changes are extracted. Then, based on basic rules, combination rules, and modification rules, scoring reasoning and overall risk assessment are performed for three assessment items: social interaction ability, language communication ability, and emotional interaction ability. Finally, a structured assessment report containing scoring basis and explanatory information is generated by combining key evidence fragments. This invention overcomes the problems of over-reliance on single textual information, lack of nonverbal behavior analysis, and opaque assessment logic.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence medical diagnostic technology, and in particular to an intelligent assessment method for autism based on multimodal perception and rule reasoning. Background Technology

[0002] Autism spectrum disorders are a common group of neurodevelopmental disorders, whose core symptoms typically manifest as impairments in social interaction, communication, and repetitive, stereotyped behaviors. Clinically, autism diagnosis usually relies on professional physicians observing and scoring children's behavior using standardized scales in specific tasks and interactive scenarios. This process demands a high level of professional experience and on-site observation skills from the physician, involves a degree of subjectivity, and is characterized by long assessment cycles, high costs, and limited reproducibility.

[0003] While existing technologies attempt to assist in autism diagnosis using methods such as automatic speech recognition, machine learning, deep learning, or large language models, the following problems still exist:

[0004] 1. Some solutions are mainly based on speech-to-text analysis, which makes it difficult to make full use of non-verbal behavioral information such as eye contact, facial expressions, body movements and interaction rhythm;

[0005] 2. Although some black-box models can output prediction results, they are difficult to provide clear scoring criteria, resulting in insufficient clinical acceptability.

[0006] 3. Existing systems are susceptible to factors such as noise, occlusion, changes in perspective, and interruption of dialogue in complex clinical environments, resulting in insufficient robustness.

[0007] 4. Existing methods mostly rely on model predictions as the main basis, but do not fully integrate clinical scale logic to construct a controllable and interpretable rule paradigm;

[0008] 5. In real-world diagnostic scenarios, autism assessment is essentially a multimodal behavioral analysis task. Relying solely on text or single speech features is insufficient to fully reflect the social interaction status and nonverbal expression abilities of the assessed individual.

[0009] Existing research has shown that relying solely on speech-to-text transcription results in the loss of crucial information for autism assessment, such as tone and emotion. Furthermore, autism diagnosis itself involves multimodal cues related to behavior, social interaction, and communication. Therefore, there is a need for an intelligent assessment system and method that can comprehensively utilize visual, auditory, and interactive behavioral information in real-world assessment scenarios, and employ rule-based reasoning to complete item-level scoring, thereby improving the system's accuracy, interpretability, and clinical applicability. Summary of the Invention

[0010] The purpose of this invention is to address the shortcomings of existing technologies by proposing an intelligent assessment method for autism based on multimodal perception and rule-based reasoning.

[0011] The objective of this invention is achieved through the following technical solution: a method for assessing autism intelligence based on multimodal perception and rule-based reasoning, the method comprising:

[0012] Collect video data, audio signals, and time points of interaction events during the communication process between the evaluated subject and the evaluator;

[0013] Preprocessing of collected data: Using a speaker separation model pre-trained with temporal continuity constraints and voiceprint difference constraints, the speech of the evaluator and the evaluated subject is distinguished and identified as text; facial expression and posture features are extracted from the video stream; and the pre-processed audio and video data and interaction event data are mapped to a unified timeline.

[0014] Behavioral features are extracted and evaluated based on rules for multimodal behavioral sequences on a unified time axis;

[0015] Based on the aforementioned behavioral characteristics, hierarchical reasoning is performed using basic rules, combined rules, and modified rules to generate item-level scoring results and overall risk assessment results;

[0016] Explanatory information is generated based on the item-level scoring results, rule hit results, and corresponding evidence fragments;

[0017] The output includes an assessment report containing item-level scoring results, overall risk assessment results, and explanatory information.

[0018] Furthermore, the collection of video data, voice signals, and interaction event time points during the communication process between the subject of evaluation and the evaluator includes: collecting facial images, head postures, and body movements of the subject of evaluation; collecting voice signals between the subject of evaluation and the evaluator; and recording time points of questions, responses, task switching, and other events.

[0019] Furthermore, the speaker separation model pre-trained with temporal continuity constraints and voiceprint difference constraints includes:

[0020] The speaker separation model is trained using an objective function that implements temporal smoothing and role constraints. The objective function includes a basic classification loss indicating whether the speaker in the current frame is correctly identified, and a temporal smoothing term. And role constraint loss, The real speaker's tag for time t. The speaker label for the time t predicted by the model.

[0021] Furthermore, the extraction of facial expression and posture features from the video stream specifically includes: performing frame-by-frame face detection and key point localization on the video stream, estimating head orientation and gaze direction; simultaneously, obtaining facial expression states in different time periods through an expression recognition model, and identifying the hand movements, body swaying, and repetitive behaviors of the evaluated subject through a posture estimation model.

[0022] Furthermore, the behavioral features corresponding to the multimodal behavioral sequence extraction and evaluation rules based on a unified time axis include:

[0023] Extract the percentage of time spent by the gaze evaluator, the success rate of the subject's gaze target tracking, and the magnitude of facial expression changes from the video stream;

[0024] Extract the average response delay, repetitive phrase ratio, and speaker separation confidence of the evaluated subjects from the speech data;

[0025] Extract the number of times the interaction was initiated, the stability index of the response of the evaluated object in the corresponding response window, and the duration of the non-frontal view from the interaction behavior;

[0026] All extracted features were normalized.

[0027] Furthermore, the basic rules include:

[0028] Rule R1: When the percentage of time the person being evaluated spends looking at the evaluator during the preset interaction phase is less than a threshold. When the basic rule R1 is hit, the score for abnormal social attention is output. ;

[0029] Rule R2: When the average response delay of the evaluated object is greater than the threshold At that time, it is determined that the basic rule R2 has been hit, and the social response anomaly score is output. ;

[0030] Rule R3: When the proportion of repeated phrases in the evaluated object exceeds the threshold When the basic rule R3 is hit, the stereotyped language exception score is output. ;

[0031] Rule R4: When the success rate of target following in a joint attention task is lower than a threshold. At that time, it is determined that the basic rule R4 has been hit, and the common attention anomaly score is output. .

[0032] Rule R5: When the number of times the evaluated individual initiates communication falls below a certain threshold. When the basic rule R5 is hit, the score for insufficient active communication is output. ;

[0033] Rule R6: When the change in the subject's facial expression is below the threshold When the basic rule R6 is hit, the abnormal score for restricted emotional expression is output. .

[0034] Furthermore, the combination rules include

[0035] The average consistency between the abnormal scores of the basic rules and the expert scores is calculated, and the weights of each basic rule are obtained after normalization.

[0036] Several basic rules were selected to form three combinations of rules: insufficient social attention, slow response, and limited emotional expression.

[0037] The scores of the selected basic rules are weighted and summed. The joint enhancement coefficient is multiplied by the sum of the scores of the selected basic rules. The summation result and the enhancement result are added together to obtain the combined rule enhancement score.

[0038] Furthermore, the correction rules are used to correct the credibility of the aforementioned candidate scores based on the acquisition quality and scene reliability, and to adjust the influence weight of the corresponding modal features in the final score, including the following rules:

[0039] Rule F1: When the duration of a non-frontal view exceeds a threshold At the same time, reduce the influence of visual social features on the scoring results;

[0040] Rule F2: When the speaker separation confidence is below the threshold At the same time, reduce the influence of language expression features on the scoring results;

[0041] Rule F3: When the overall stability of the subject's response to interactive stimuli is below a threshold At the same time, reduce the influence of emotional interaction-related features on the scoring results.

[0042] Furthermore, the threshold conditions are determined by combining historical sample statistics with expert annotation calibration. For each behavioral feature, firstly, historical samples that have completed manual evaluation are collected and divided into different level sample sets according to the expert scoring results; then, the distribution of the behavioral feature in different level sample sets is statistically analyzed to generate a candidate threshold set; then, the consistency between the automatic scoring results and the expert scoring results corresponding to each candidate threshold is calculated; finally, the candidate threshold with the highest consistency is selected as the scoring threshold corresponding to the behavioral feature.

[0043] Furthermore, the generation of explanatory information based on the item-level scoring results, rule-hitting results, and corresponding evidence fragments includes:

[0044] Set up assessment items for social interaction ability, language communication ability, and emotional interaction ability;

[0045] The social interaction ability assessment items are used to evaluate the behavior of the assessed individuals in terms of social attention, joint attention and proactive communication, and reflect the degree of abnormality in their social interaction ability. The social interaction ability assessment items correspond to basic rules R1, rule R4 and rule R5, correction rule F1, and social attention deficiency combination rules.

[0046] The language communication ability assessment items are used to evaluate the behavior of the assessed individuals in terms of response, language expression and initiative in communication, and to reflect the degree of abnormality in their language communication ability. The language communication ability assessment items correspond to basic rules R2, rule R3 and rule R5, modified rule F2, and slow response combination rule.

[0047] The emotional interaction ability assessment items are used to evaluate the behavior of the assessed individuals in terms of social attention, responsiveness, and emotional expression, and to reflect the degree of abnormality in their emotional interaction ability. The emotional interaction ability assessment items correspond to basic rules R1, R2, and R6, modified rule F3, and combined rules for restricted emotional expression.

[0048] The beneficial effects of this invention are:

[0049] 1. By introducing cameras and microphones, multimodal synchronous acquisition of visual, auditory and interactive behaviors is achieved. Combined with interactive events such as questioning and response, time alignment is performed to form structured behavioral fragments for evaluation items, thereby more accurately identifying nonverbal abnormal behaviors.

[0050] 2. By constructing a scoring mechanism centered on rule-based reasoning, the assessment process has a clear logical path and controllability, thereby enhancing clinical credibility;

[0051] 3. Through item-level multimodal feature analysis, abnormal patterns such as social interaction, language communication, and stereotyped behavior can be identified more precisely, thereby improving assessment sensitivity;

[0052] 4. The interpretation and generation module automatically organizes evidence and conclusions, and can output interpretable reports for doctors, reducing the cost of manual processing.

[0053] 5. This invention can be deployed in hospital assessment rooms, child development screening scenarios, remote assisted diagnosis scenarios, and portable terminals, and has good practical application value. Attached Figure Description

[0054] Figure 1 This is a block diagram of the overall structure of the intelligent assessment method for autism based on visual and auditory multimodal perception and rule reasoning in an embodiment of the present invention.

[0055] Figure 2 This is a schematic diagram illustrating the process of multimodal data acquisition, preprocessing, behavioral feature extraction, rule reasoning, and result output in an embodiment of the present invention.

[0056] Figure 3 This is a schematic diagram of the item-level rule reasoning mechanism in an embodiment of the present invention.

[0057] Figure 4 This is a schematic diagram illustrating the generation and doctor review interaction interface in an embodiment of the present invention.

[0058] Figure 5 This is a comparison chart of item-level accuracy in an embodiment of the present invention.

[0059] Figure 6 This is a comparison chart of the accuracy of ablation experiments in the embodiments of the present invention.

[0060] Figure 7 This is a graph showing the accuracy variation under non-frontal view interference conditions in an embodiment of the present invention.

[0061] Figure 8 This is a confusion matrix diagram of overall risk classification in an embodiment of the present invention. Detailed Implementation

[0062] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

[0063] like Figure 1 As shown in the figure, this embodiment provides an intelligent assessment method for autism based on visual and auditory multimodal perception and rule reasoning, including: multimodal data acquisition, preprocessing, behavioral feature extraction, rule reasoning and scoring, interpretation generation, and result output.

[0064] Step 1: Data Acquisition and Preprocessing

[0065] In this embodiment, multimodal data acquisition is configured on a clinical assessment terminal. The terminal includes a high-definition camera, an array microphone, and an interactive recording unit. The camera is used to acquire facial images, head posture, and limb movements of the subject being assessed; the microphone is used to acquire voice signals between the subject and the assessor; and the interactive recording unit is used to record time points of questions, responses, task switching, and other events.

[0066] Preprocessing begins with noise reduction and speech activity detection of the audio signal to remove ambient background noise and invalid silence segments. Then, a speaker separation model distinguishes between the evaluator's and the evaluated subject's speeches, and the evaluated subject's speech segments are input into an automatic speech recognition model to obtain sentence-by-sentence transcribed text.

[0067] Among them, the speaker separation model is not a general multi-person speech separation model, but is adapted and optimized for the dual-subject interaction features of the evaluator and the evaluated in the evaluation scenario. By introducing temporal continuity constraints and voiceprint difference constraints, it achieves stable differentiation in short sentence alternation and overlapping speech scenarios.

[0068] Let the audio feature sequence be:

[0069]

[0070] The model outputs the probability that the speaker at time t belongs to speaker k:

[0071]

[0072] in The real speaker's tag for time t. , representing the evaluator and the evaluated, respectively.

[0073] This method incorporates an objective function that achieves temporal smoothing and role constraints:

[0074]

[0075] The basic classification loss indicates whether the speaker's classification in the current frame is correct; The real speaker's tag for time t. Speaker labels predicted by the model; For time-series smoothing, avoid frequent jittering of speaker labels; The role-constrained loss is used to reflect the characteristics of the "evaluator-evaluated object" dual-role scenario. These are the weighting coefficients.

[0076] The objective function is the training objective of the speaker separation model, used to jointly constrain classification accuracy, temporal continuity, and role consistency during training.

[0077] The automatic speech recognition model is enhanced in a contextualized way to address the characteristics of children's speech, such as unclear pronunciation, frequent pauses, repetitive expressions, and short responses. The model is retrained using the collected children's speech dataset to improve the accuracy of recognizing the speech content and abnormal language patterns of the evaluated subjects.

[0078] In this embodiment, the automatic speech recognition model is used to convert the speech signal of the evaluated subject into a text sequence. Let the input speech feature be X, and the output text sequence be Y, then the decoding objective of the automatic speech recognition model can be expressed as:

[0079]

[0080] in, This represents the probability of recognizing a text sequence Y based on input speech features X; This represents the context constraint probability of the text sequence Y after combining the evaluation context information C; This represents the probability of keyword enhancement in the text sequence Y after combining the keyword set K of the evaluation scenario; and , which is a weighting coefficient used to adjust the influence of contextual constraints and keyword enhancements during the recognition process.

[0081] In this way, the automatic speech recognition model not only utilizes acoustic features of speech for text decoding, but also combines contextual information and preset keywords from the evaluation scenario to constrain and correct the recognition results, thereby improving the accuracy of recognizing speech features such as unclear pronunciation, frequent pauses, repetitive expressions, and short sentence responses in children. Furthermore, the text results output by the automatic speech recognition model can serve as input for subsequent language behavior analysis, rule reasoning, and scoring.

[0082] The video preprocessing section performs frame-by-frame face detection and key point localization on the video stream, and further estimates head orientation and gaze direction. Simultaneously, it uses an expression recognition model to obtain facial expression states at different time intervals, and a pose estimation model to identify the subject's hand movements, body swaying, and repetitive behaviors. Finally, the audio analysis results, video analysis results, and interaction events are mapped onto a unified timeline to form a multimodal behavior sequence.

[0083] Step 2: Behavioral Feature Extraction

[0084] like Figure 2 As shown, behavioral feature extraction involves segmenting the behavioral segments of the evaluated object from the multimodal behavioral sequence on a unified time axis according to a preset time window and interaction event boundary, and extracting language expression features, phonetic prosody features, visual social features, and interactive behavior features corresponding to subsequent basic rules, combination rules, and modification rules to form a structured feature vector for rule-oriented reasoning.

[0085] In terms of visual social feature extraction, the system analyzes the social attention behavior, joint attention behavior, and emotional expression behavior of the evaluated subjects based on the results of face detection, facial landmark localization, head pose estimation, gaze direction estimation, and facial expression recognition. Regarding the percentage of time the subject gazes at the evaluator, the system first determines the evaluator's location in the video frame and, combined with the subject's head orientation and gaze direction, determines whether the subject gazes at the evaluator at various times. Then, it calculates the total time the subject's gaze falls within the evaluator's area during a preset interaction phase and compares this with the total interaction time of that phase to obtain the percentage of time the subject gazes at the evaluator. This feature serves as the input to basic rule R1. For the target following success rate in the joint attention task, the system identifies the starting time of the evaluator's target indication based on the event markers of the joint attention task and detects whether the subject's gaze shifts from the evaluator's area to the target object area within a preset time window. It calculates the ratio of the number of successful target followings to the total number of times in the joint attention task to obtain the target following success rate. This feature serves as the input to basic rule R4. Regarding the amplitude of facial expression changes, the system uses an expression recognition model to extract the intensity values corresponding to different expression states within adjacent time windows, and calculates the amplitude of facial expression changes based on the range or fluctuation of the expression intensity sequence. This feature serves as the input to the basic rule R6.

[0086] In terms of speech prosody feature extraction, the system analyzes the response behavior of the evaluated subject in a question-and-answer scenario based on speech activity detection results, speaker separation results, and interaction event timestamps. For the average response delay, the system first determines the end time of the question based on the interaction event records, then identifies the start time of the first valid vocal segment of the evaluated subject by combining the speaker separation results, and calculates the time difference between the two. The average time difference across multiple question-and-answer segments is then averaged to obtain the average response delay, which serves as the input to the basic rule R2.

[0087] In terms of language expression feature extraction, the system first obtains a sentence-by-sentence text sequence corresponding only to the evaluated subject based on the speaker separation results and automatic speech recognition results, and then aligns the text by combining the questioning time, response time, and interaction round information. Regarding the proportion of repeated phrases, the system detects repeated identical or highly similar phrases within adjacent sentences or preset time windows, and calculates the ratio of the number of repeated phrase occurrences to the total number of phrases to obtain the proportion of repeated phrases. This feature serves as the input to the basic rule R3. Furthermore, the system outputs a speaker separation confidence score to characterize the reliability of the current speech separation results. When the speaker separation confidence score is low, it indicates a decrease in the reliability of distinguishing the speaker's speech from the evaluated subject's speech. This index serves as the input to the correction rule F2 to reduce the influence of language expression features on the scoring results.

[0088] In terms of interactive behavior feature extraction, the system combines interactive event sequences, speech-to-text results, and action recognition results to analyze the proactive communication behavior and interactive response quality of the evaluated subjects. For the number of proactively initiated interactions, the system counts the occurrences of interactive behaviors such as vocalization, questioning, greeting, pointing, or gesturing initiated by the evaluated subject without prompting. Preferably, only behaviors that occur before the evaluator's guidance and meet the minimum duration or minimum semantic validity conditions are recorded as proactive communication events. The total number of proactive communication events within the preset evaluation phase is taken as the number of proactively initiated interactions, and this feature serves as the input to the basic rule R5.

[0089] Furthermore, the system extracts the overall response stability of the assessed individual to interactive stimuli to characterize the consistency of their response and the continuity of their interaction during continuous interaction. Specifically, based on interactive stimuli such as asking questions, teasing, calling names, or gesturing, the system detects the assessed individual's attention shift, facial expression feedback, vocal feedback, or motor feedback within the corresponding response window. It also calculates the overall response stability index by combining the consistency of response timing, the continuity of response types, and the occurrence of missing responses across multiple interaction rounds. When the overall response stability is low, it indicates that the assessed individual's feedback quality fluctuates significantly or lacks consistency during the interaction. This index serves as input to correction rule F3 to reduce the influence of emotional interaction features on the scoring results.

[0090] In terms of visual quality assessment, the system further calculates the duration of non-frontal viewpoints to evaluate the usability of visual modality data. Specifically, the system determines whether the subject is in a non-frontal viewpoint state based on the subject's head posture angle and accumulates the duration of the non-frontal viewpoint state to obtain the non-frontal viewpoint duration. This index serves as input to the correction rule F1, used to reduce the influence of visual social features on the scoring results when the subject deviates from a frontal viewpoint for an extended period.

[0091] To facilitate subsequent rule reasoning, this embodiment normalizes the aforementioned features and establishes threshold conditions for each feature based on historical sample statistics and expert annotation results. Specifically, the proportion of gaze duration to the evaluator, target following success rate, number of times proactive communication is initiated, and the amplitude of facial expression changes are considered features where "the lower the feature value, the more abnormal it is"; the average response delay and the proportion of repeated phrases are considered features where "the higher the feature value, the more abnormal it is". The normalized feature values are input into the corresponding basic rules R1 to R6 to generate basic anomaly scores. to When multiple related basic rules are hit simultaneously, combination rules C1 to C3 are further triggered, outputting an enhanced score. to Simultaneously, the duration of non-frontal viewpoints, speaker separation confidence, and the overall response stability of the evaluated subject to interactive stimuli are input into correction rules F1 to F3, respectively, to generate corresponding confidence correction coefficients. to This provides a basis for scoring subsequent items.

[0092] Step 3: Rule-based reasoning and scoring:

[0093] like Figure 3 As shown, the rule reasoning and scoring in this embodiment adopts a hierarchical reasoning method of "basic rules + combined rules + modified rules" to score multimodal behavioral features at the item level and further generate a total score of ten points.

[0094] Among them, the basic rules are used to output basic anomaly scores based on whether a single behavioral feature exceeds the corresponding threshold; the combination rules are used to enhance the candidate scores when multiple related basic rules are hit at the same time; and the correction rules are used to adjust the credibility weight of the candidate scores by combining the collection quality and the reliability of the scene.

[0095] In this embodiment, the basic rules are used to judge a single behavioral feature and output the corresponding basic anomaly score. Preferably, each basic rule includes four parts: feature object, comparison relationship, threshold condition, and anomaly score mapping relationship. Specifically, it includes the following rules:

[0096] Rule R1: When the percentage of time the person being evaluated spends looking at the evaluator during the preset interaction phase is less than a threshold. When the basic rule R1 is hit, the score for abnormal social attention is output. ;

[0097] Rule R2: When the average response delay of the evaluated object is greater than the threshold At that time, it is determined that the basic rule R2 has been hit, and the social response anomaly score is output. ;

[0098] Rule R3: When the proportion of repeated phrases in the evaluated object exceeds the threshold When the basic rule R3 is hit, the stereotyped language exception score is output. ;

[0099] Rule R4: When the success rate of target following in a joint attention task is lower than a threshold. At that time, it is determined that the basic rule R4 has been hit, and the common attention anomaly score is output. .

[0100] Rule R5: When the number of times the evaluated individual initiates communication falls below a certain threshold. When the basic rule R5 is hit, the score for insufficient active communication is output. ;

[0101] Rule R6: When the change in the subject's facial expression is below the threshold When the basic rule R6 is hit, the abnormal score for restricted emotional expression is output. .

[0102] The scores for the rule "the larger the feature value, the more abnormal" (such as the proportion of repeated phrases, response delay) are defined as follows:

[0103]

[0104] The rule that "the smaller the feature value, the more abnormal" (such as the proportion of fixation duration, the success rate of joint attention, the initiative to initiate communication, and the amplitude of facial expression changes) is defined as follows:

[0105]

[0106] The threshold conditions are determined by combining historical sample statistics with expert annotation calibration. Specifically, for each behavioral feature, historical samples that have undergone manual evaluation are first collected and divided into different level sample sets according to the expert scoring results; then, the distribution of the behavioral feature in different level sample sets is statistically analyzed to generate a candidate threshold set; next, the consistency between the automatic scoring results and the expert scoring results corresponding to each candidate threshold is calculated; finally, the candidate threshold with the highest consistency is selected as the scoring threshold corresponding to the behavioral feature.

[0107] Let a certain behavioral feature take the value of the nth sample. The expert rating is Candidate threshold is The predicted score obtained from the rule mapping corresponding to the candidate threshold is: Then a threshold evaluation function can be constructed:

[0108]

[0109] Where N represents the total number of historical samples, This represents an indicator function, which takes a value of 1 when the condition within the parentheses is true, and a value of 0 otherwise. Further, a candidate threshold that maximizes the evaluation function is selected as the target threshold, i.e.:

[0110]

[0111] in, Represents the set of candidate thresholds. This represents the final determined target threshold. To obtain the candidate threshold set, for each behavioral feature, first collect the values of that feature from historical samples and sort them according to their magnitude; then, take the midpoint between adjacent sorted values as the candidate threshold, thus generating the candidate threshold set. Let the sorted behavioral feature values be... Let N represent the number of historical samples. Then the candidate threshold set can be represented as:

[0112]

[0113] The baseline anomaly score is not a fixed binary result, but is determined by classifying the degree of deviation of the behavioral feature value from the corresponding threshold. That is, when the behavioral feature value only slightly exceeds the threshold, a lower baseline anomaly score is output; when the behavioral feature value significantly deviates from the normal range, a higher baseline anomaly score is output, thereby characterizing the strength of the anomaly.

[0114] The combined rules select two or more basic rules from the above six basic rules for joint reasoning to identify complex anomaly patterns. The enhanced score output by the combined rules is used to characterize the combined impact when multiple basic anomalies occur together.

[0115] For the k-th combination rule, let its association be... There are 1 basic rule, and the anomaly score corresponding to each basic rule is as follows: The corresponding weights are respectively The enhanced score output by this combination rule is... It can be represented as:

[0116]

[0117] in, The joint enhancement coefficient is used to characterize the synergistic enhancement effect when multiple basic anomalies occur simultaneously. The weight coefficient is the value corresponding to the u-th basic rule in the k-th combined rule, used to characterize the contribution of this basic rule to the composite anomaly pattern recognition result. The weight coefficient can be determined based on the correlation between the anomaly scores of each basic rule and the expert scores in historical evaluation samples. Preferably, the correlation coefficient between the anomaly scores of each basic rule and the expert scores of the target item is calculated first. Let the anomaly score corresponding to the u-th basic rule be . The expert rating for the nth sample is If there are N samples in total, then the relevance is... It can be represented as:

[0118]

[0119] This formula represents the average degree of consistency between the abnormal scores of the basic rules and the expert scores. The larger the value, the greater the contribution of the basic rule to the target item's score.

[0120] Then, the corresponding weight coefficients are obtained through normalization:

[0121]

[0122] Here, M_k represents the number of basic rules included in the k-th combination rule. This allows basic rules that contribute more to the overall score to have a higher weight, thereby improving the accuracy and interpretability of the combination rules in identifying complex anomaly patterns.

[0123] Preferably, the following combination rules can be set:

[0124] Rule C1: When basic rules R1, R4, and R5 are all triggered simultaneously, the social interaction quality degradation combination rule is activated, and an enhancement score is output. This means that the decline in social interaction ability is characterized by insufficient social attention, abnormal shared attention, and insufficient proactive communication.

[0125] Rule C2: When basic rules R2, R3, and R5 are all met simultaneously, the abnormal combination rule for language communication is triggered, and an enhanced score is output. This means that abnormal language communication is characterized by slow social response, stereotyped language, and insufficient proactive communication.

[0126] Rule C3: When basic rules R1, R2, and R6 are all triggered simultaneously, the emotional interaction restriction combination rule is activated, and an enhanced score is output. This means that limited emotional interaction is characterized by insufficient social attention, slow response, and limited emotional expression.

[0127] The correction rules are used to adjust the credibility of the aforementioned candidate scores based on the acquisition quality and scene reliability. Their function is not to change the anomaly type, but rather to adjust the influence weight of the corresponding modal features in the final score. Specifically, they include the following rules:

[0128] Rule F1: When the duration of a non-frontal view exceeds a threshold At the same time, reduce the influence of visual social features on the scoring results;

[0129] Rule F2: When the speaker separation confidence is below the threshold At the same time, reduce the influence of language expression features on the scoring results.

[0130] Rule F3: When the overall stability of the subject's response to interactive stimuli is below a threshold At the same time, reduce the influence of emotional interaction-related features on the scoring results.

[0131] In this embodiment, let the quality evaluation index corresponding to the j-th evaluation item be denoted as . The corresponding thresholds are denoted as follows: The quality deviation is determined based on its degree of deviation from the threshold. Then it can be expressed as:

[0132]

[0133] Furthermore, the correction factor in the j-th evaluation item It can be represented as:

[0134]

[0135] In this embodiment, the system sets three assessment items: social interaction ability assessment item, language communication ability assessment item, and emotional interaction ability assessment item.

[0136] The social interaction ability assessment items are used to evaluate the assessed individuals' behavioral performance in social attention, shared attention, and proactive communication, and to reflect the degree of abnormality in their social interaction ability. These social interaction ability assessment items correspond to basic rules R1, R4, and R5, combined rule C1, and modified rule F1.

[0137] The language communication ability assessment items are used to evaluate the assessee's performance in response, language expression, and proactive communication, and to reflect the degree of abnormality in their language communication ability. These assessment items correspond to basic rules R2, R3, and R5, combined rule C2, and modified rule F2.

[0138] The emotional interaction ability assessment items are used to evaluate the assessed individuals' behavioral performance in terms of social attention, responsiveness, and emotional expression, and to reflect the degree of abnormality in their emotional interaction ability. These emotional interaction ability assessment items correspond to basic rules R1, R2, and R6, combined rule C3, and modified rule F3.

[0139] For the j-th evaluation item, let m be the number of its basic rules and n be the number of its combined rules. Let the anomaly score output by the i-th basic rule be... The corresponding weight is The enhanced score output by the k-th combination rule is The corresponding weight is The reliability coefficient generated by the corrected rules is: ,in The candidate score for the j-th evaluation item can then be expressed as:

[0140]

[0141] Among them, the basic anomaly score The enhancement score can be determined by classifying the degree of deviation of the corresponding behavioral feature value from the threshold; The confidence coefficient is used to characterize the combined effect of multiple anomalous patterns occurring simultaneously; It is used to reduce the interference of background noise, occlusion, non-frontal viewpoint, or insufficient speaker separation confidence on the scoring results.

[0142] Furthermore, to facilitate alignment with clinical scoring standards, this embodiment uses candidate scores for each item. Mapped to item-level discrete scores Preferably, if a single item is aligned with clinical experience using a 0-2 point scale, it can be represented as follows:

[0143]

[0144] in, and and represent the grading thresholds for the j-th evaluation item, respectively.

[0145] Item-level scores were obtained for all evaluated items. Then, the system summarizes the scores from multiple item-level assessments and maps them to a total score out of ten. Let there be a total of p assessment items, and the summarization weight of the j-th assessment item be... The overall original score can then be expressed as:

[0146]

[0147] Furthermore, to output a uniform total score out of ten, the overall original score can be normalized as follows:

[0148]

[0149] in, This indicates the final total score out of ten. and These represent the minimum and maximum values of the overall original scores under the historical samples or scoring rules, respectively.

[0150] Based on the total score out of ten, an overall risk level is further generated. When it is judged as low risk; when It was determined to be of medium risk at that time; The risk level is determined to be high. This establishes a complete reasoning chain from "rule hit - item scoring - total score output - risk classification".

[0151] Compared with existing methods that directly draw conclusions based on a single threshold, this embodiment uses a hierarchical collaborative mechanism of basic rules, combined rules, and modified rules to first generate item-level candidate scores, and then obtain a total score of ten through normalization mapping, thereby making the scoring process more interpretable, hierarchical, and clinically adaptable.

[0152] Step 4: Explanation, Generation, and Result Output:

[0153] like Figure 4 As shown, the interpretation generation in this embodiment includes an evidence retrieval unit and a text generation unit.

[0154] The evidence retrieval unit uses a multimodal large model to locate key evidence fragments from the original audio, video, and text based on item-level scoring results and rule hits. Specifically, when basic rules R1, R2, R3, R4, R5, or R6 are hit, the system extracts the corresponding eye trajectory fragments, response audio fragments, repeated phrase text fragments, joint attention task trajectory fragments, active communication event records, and facial expression change video fragments as evidence, respectively. When combined rules C1, C2, or C3 are hit, the system further associates evidence fragments corresponding to multiple basic rules to form a composite evidence set.

[0155] The text generation unit generates structured explanatory text based on the item scoring results, rule hit descriptions, and key evidence fragments. Preferably, the text generation unit can use a large language model to organize the rule results into natural language, but the model is only used to generate explanations and reports and does not directly participate in the final scoring. This utilizes the expressive power of natural language models while avoiding excessive reliance on black-box models for diagnostic scoring, thus improving overall interpretability and controllability.

[0156] The system integrates item-level scores, overall risk levels, rule hit information, key evidence snippets, and explanatory text into a structured assessment report, which is then displayed on the physician's terminal. Physicians can click on any item to view the corresponding original audio, keyframes, and text evidence for manual review.

[0157] Step 5: System Verification

[0158] After completing multimodal perception, rule reasoning, risk scoring, and explanation generation, the effectiveness of the method of the present invention is tested using a validation dataset. During the validation process, the sample to be evaluated is input into the intelligent evaluation system, and the system outputs the judgment result of each evaluation item, the overall risk level, and the corresponding explanation information. The output results are then compared with manually labeled results to evaluate the accuracy and stability of the method of the present invention.

[0159] like Figure 5As shown, by comparing the accuracy results of the text baseline, the best single-modal method, the method of removing combination rules, and the system of the present invention on different evaluation items, it can be seen that the system of the present invention has achieved higher recognition accuracy in the three items of social interaction ability, language communication ability, and emotional interaction ability. This indicates that the present invention can effectively improve the accuracy of item-level evaluation results by integrating multimodal behavioral features and introducing a rule-based reasoning mechanism.

[0160] like Figure 6 As shown, in the ablation experiment, the system accuracy decreased to varying degrees after removing visual features, language features, combination rules, or correction rules respectively. The performance decline was more significant after removing key modal features, indicating that the multimodal feature extraction, combination rule reasoning, and correction mechanism in this invention all contribute to the evaluation results, thus verifying the effectiveness of the overall method design.

[0161] like Figure 7 As shown, under interference conditions where the duration of non-frontal view gradually increases, the system with added correction rules maintains higher accuracy under various interference intensities, and the performance degradation is significantly less than that of the system without correction rules. This indicates that the correction rules proposed in this invention can effectively suppress the adverse effects of complex acquisition conditions on the evaluation results, thereby improving the robustness of the system in practical application scenarios.

[0162] like Figure 8 As shown, the confusion matrix of the system of the present invention in the overall risk classification task indicates that most of the samples of low risk, medium risk and high risk are correctly classified, and the proportion of the diagonal area is significantly higher, indicating that the present invention has good overall discrimination ability and classification consistency in the three-class risk assessment task.

[0163] In summary, through item-level accuracy comparison, ablation experiments, robustness testing in complex scenarios, and overall risk grading confusion matrix analysis, it can be demonstrated that the method of the present invention has good performance in terms of assessment accuracy, module effectiveness, environmental adaptability, and overall classification performance, thus verifying the effectiveness and practical value of the present invention.

[0164] This invention's system can be applied to hospital pediatric developmental and behavioral clinics, rehabilitation center screening rooms, remote assisted diagnostic platforms, and portable smart terminals. In hospital settings, the system can serve as an auxiliary tool for clinicians, providing quantitative scoring and evidence support; in screening settings, the system can be used for early warning of high-risk children; in remote settings, the system can be deployed on edge terminals or cloud platforms to automatically evaluate and generate reports from uploaded audio and video data.

[0165] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and embodiments are to be considered exemplary only, and the true scope and spirit of this application are indicated by the claims.

[0166] It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this application. This application is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A method for autism intelligence assessment based on multi-modal perception and rule inference, characterized in that, The method includes: Collect video data, audio signals, and time points of interaction events during the communication process between the evaluated subject and the evaluator; Preprocessing of collected data: Using a speaker separation model pre-trained with temporal continuity constraints and voiceprint difference constraints, the speech of the evaluator and the evaluated subject is distinguished and identified as text; facial expression and posture features are extracted from the video stream; and the pre-processed audio and video data and interaction event data are mapped to a unified timeline. Behavioral features are extracted and evaluated based on rules for multimodal behavioral sequences on a unified time axis; Based on the aforementioned behavioral characteristics, hierarchical reasoning is performed using basic rules, combined rules, and modified rules to generate item-level scoring results and overall risk assessment results; Explanatory information is generated based on the item-level scoring results, rule hit results, and corresponding evidence fragments; The output includes an assessment report containing item-level scoring results, overall risk assessment results, and explanatory information.

2. The method for autism intelligence assessment based on multi-modal perception and rule inference according to claim 1, characterized in that, The collection of video data, voice signals, and interaction event time points during the communication process between the subject of evaluation and the evaluator includes: collecting facial images, head posture, and body movements of the subject of evaluation; collecting voice signals between the subject of evaluation and the evaluator; and recording time points of questions, responses, task switching, and other events. 3.The autism intelligent assessment method based on multi-modal perception and rule inference of claim 1, wherein, The speaker separation model pre-trained with temporal continuity constraints and voiceprint difference constraints includes: The speaker separation model is trained using a timing smoothing and role constraint target function, the target function including a basic classification loss of whether the current frame speaker discrimination is correct, a timing smoothing term and a role constraint loss and a role constraint loss, is a real speaker label at time t, is a model predicted speaker label at time t.

4. The autism intelligence assessment method based on multimodal perception and rule reasoning according to claim 1, characterized in that, The extraction of facial expression and posture features from the video stream specifically includes: performing frame-by-frame face detection and key point localization on the video stream, estimating head orientation and gaze direction; simultaneously, obtaining facial expression states at different time periods through an expression recognition model, and identifying the hand movements, body swaying, and repetitive behaviors of the evaluated subject through a posture estimation model.

5. The autism intelligence assessment method based on multimodal perception and rule reasoning according to claim 1, characterized in that, The behavioral features corresponding to the multimodal behavior sequence extraction and evaluation rules based on a unified time axis include: Extract the percentage of time spent by the gaze evaluator, the success rate of the subject's gaze target tracking, and the magnitude of facial expression changes from the video stream; Extract the average response delay, repetitive phrase ratio, and speaker separation confidence of the evaluated subjects from the speech data; Extract the number of times the interaction was initiated, the stability index of the response of the evaluated object in the corresponding response window, and the duration of the non-frontal view from the interaction behavior; All extracted features were normalized.

6. The autism intelligence assessment method based on multimodal perception and rule reasoning according to claim 1, characterized in that, The basic rules include: Rule R1: When the percentage of time the person being evaluated spends looking at the evaluator during the preset interaction phase is less than a threshold. When the basic rule R1 is hit, the score for abnormal social attention is output. ; Rule R2: When the average response delay of the evaluated object is greater than the threshold At that time, it is determined that the basic rule R2 has been hit, and the social response anomaly score is output. ; Rule R3: When the proportion of repeated phrases in the evaluated object exceeds the threshold When the basic rule R3 is hit, the stereotyped language exception score is output. ; Rule R4: When the success rate of target following in a joint attention task is lower than a threshold. At that time, it is determined that the basic rule R4 has been hit, and the common attention anomaly score is output. ; Rule R5: When the number of times the evaluated individual initiates communication falls below a certain threshold. When the basic rule R5 is hit, the score for insufficient active communication is output. ; Rule R6: When the change in the subject's facial expression is below the threshold When the basic rule R6 is hit, the abnormal score for restricted emotional expression is output. .

7. The autism intelligence assessment method based on multimodal perception and rule reasoning according to claim 6, characterized in that, The combination rules include The average consistency between the abnormal scores of the basic rules and the expert scores is calculated, and the weights of each basic rule are obtained after normalization. Several basic rules were selected to form three combinations of rules: insufficient social attention, slow response, and limited emotional expression. The scores of the selected basic rules are weighted and summed. The joint enhancement coefficient is multiplied by the sum of the scores of the selected basic rules. The summation result and the enhancement result are added together to obtain the combined rule enhancement score.

8. The autism intelligence assessment method based on multimodal perception and rule reasoning according to claim 7, characterized in that, The correction rules are used to correct the credibility of the aforementioned candidate scores based on the acquisition quality and scene reliability, and to adjust the influence weight of the corresponding modal features in the final score, including the following rules: Rule F1: When the duration of a non-frontal view exceeds a threshold At the same time, reduce the influence of visual social features on the scoring results; Rule F2: When the speaker separation confidence is below the threshold At the same time, reduce the influence of language expression features on the scoring results; Rule F3: When the overall stability of the subject's response to interactive stimuli is below a threshold At the same time, reduce the influence of emotional interaction-related features on the scoring results.

9. The autism intelligence assessment method based on multimodal perception and rule reasoning according to claim 8, characterized in that, The threshold conditions are determined by combining historical sample statistics with expert annotation calibration. For each behavioral feature, historical samples that have been manually evaluated are first collected and divided into different level sample sets according to the expert scoring results. Then, the distribution of the behavioral feature in different level sample sets is statistically analyzed to generate a candidate threshold set. Next, calculate the degree of consistency between the automatic scoring results and the expert scoring results corresponding to each candidate threshold; finally, select the candidate threshold with the highest degree of consistency as the scoring threshold corresponding to the behavioral feature.

10. The autism intelligence assessment method based on multimodal perception and rule reasoning according to claim 8, characterized in that, The generation of explanatory information based on item-level scoring results, rule-hitting results, and corresponding evidence fragments includes: Set up assessment items for social interaction ability, language communication ability, and emotional interaction ability; The social interaction ability assessment items are used to evaluate the behavior of the assessed individuals in terms of social attention, joint attention and proactive communication, and reflect the degree of abnormality in their social interaction ability. The social interaction ability assessment items correspond to basic rules R1, rule R4 and rule R5, correction rule F1, and social attention deficiency combination rules. The language communication ability assessment items are used to evaluate the behavior of the assessed individuals in terms of response, language expression and initiative in communication, and to reflect the degree of abnormality in their language communication ability. The language communication ability assessment items correspond to basic rules R2, rule R3 and rule R5, modified rule F2, and slow response combination rule. The emotional interaction ability assessment items are used to evaluate the behavior of the assessed individuals in terms of social attention, responsiveness, and emotional expression, and to reflect the degree of abnormality in their emotional interaction ability. The emotional interaction ability assessment items correspond to basic rules R1, R2, and R6, modified rule F3, and combined rules for restricted emotional expression.