Artificial intelligence-based structured interview intelligent training evaluation system and method

By constructing a job knowledge graph and a causal illumination adaptive cascade architecture, combined with multimodal data analysis, the problem of scoring misjudgment caused by job semantic drift in the existing system was solved, and a job-adaptive evaluation standard was realized, improving the accuracy and efficiency of recruitment.

CN122243696APending Publication Date: 2026-06-19SHANDONG XUEYICHENG EDUCATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG XUEYICHENG EDUCATION TECHNOLOGY CO LTD
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing AI-powered interview assessment systems are unable to distinguish the language expression characteristics of different positions, leading to technical candidates being misjudged as having poor logic due to their use of technical jargon and pauses in thought. This renders the scoring criteria ineffective and affects the accuracy of recruitment.

Method used

A job knowledge graph is constructed, multimodal data is collected, facial and cognitive state features are extracted through a causal illumination adaptive cascade architecture, and adaptive modulation is performed by combining semantic decoupling and implicit association analysis to obtain the evaluation results of candidates.

Benefits of technology

It has implemented job-adaptive evaluation criteria, accurately identifies candidates' true state and verbal expression, improves the person-job fit and recruitment screening efficiency, and reduces the false screening rate of high-quality talents.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243696A_ABST
    Figure CN122243696A_ABST
Patent Text Reader

Abstract

This application provides an AI-based intelligent training and evaluation system and method for structured interviews, relating to the field of AI-based interview evaluation. The method includes: constructing a job knowledge graph and collecting multimodal data on candidates' target jobs and the interview process; extracting facial features and cognitive state features from video data using a causal illumination adaptive cascade architecture; semantically decoupling language data based on the job type of the target job and combining it with corresponding competencies and basic weights to obtain language features; extracting speech features from speech data and combining language features, facial features, and cognitive state features; capturing temporal correlations through a latent association analysis module and verifying them through environmental causal intervention to obtain latent association anomaly scores; and adaptively modulating the basic weights, cognitive state features, and latent association anomaly scores to obtain the candidate's evaluation results. This method solves the technical problem of scoring criteria failure caused by job semantic drift.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence interview assessment, and in particular to an AI-based intelligent training and assessment system and method for structured interviews. Background Technology

[0002] In the field of AI-powered interview assessment, existing systems can collect and analyze language, voice, and video data from interviews. Relying on large language models, they automatically score candidates' expressions, logic, and emotions, becoming an important tool for corporate recruitment. Their language assessment modules are typically trained on general corpora, using general semantics such as "structural completeness, fluent expression, and rich emotion" as standards for excellent answers.

[0003] However, in practice, when technical development candidates answer professional questions, their language expression generally exhibits characteristics such as loose structure, a large amount of industry technical jargon, and frequent pauses for thought during the answer process. For example, they may mention professional terms such as "PR merging," "CI / CD pipeline," and "microservice circuit breaker." When explaining algorithm design and project development processes, they may have natural, non-pause-like pauses due to thinking about professional details. On the other hand, the language expression of candidates for sales, operations, and other positions is more likely to be characterized by clear structure, clear logical connectors, fluent and uninterrupted expression, and rich emotional expression. The language scoring module of the evaluation system, because it does not differentiate between job types, relies on a general corpus for training. The semantic representation of excellent answers in this corpus mainly consists of structured, emotional, and fluent general expressions. When faced with the professional expression patterns of candidates for technical positions, it maps unstructured expressions, technical jargon, and errors in thinking pauses that are common in the industry for technical positions to the negative scoring space. This results in low scores for candidates for technical positions, indicating poor logic, lack of fluency, and weak language organization skills. Such evaluations are not directly related to the candidates' actual technical expertise or job suitability, ultimately leading to the misscreening of candidates with outstanding technical abilities. Summary of the Invention

[0004] This application provides an AI-based intelligent training and evaluation system and method for structured interviews, which solves the technical problem of scoring criteria failure caused by semantic drift in existing technical positions.

[0005] To achieve the above objectives, this application adopts the following technical solution: Firstly, an AI-based intelligent training and evaluation method for structured interviews includes: constructing a job knowledge graph, which includes job type, competencies, and basic weights; collecting multimodal data from candidates' target jobs and the interview process, including language data, speech data, and video data; extracting facial features and cognitive state features from video data using a causal illumination adaptive cascade architecture; semantically decoupling language data based on the job type of the target job and combining it with corresponding competencies and basic weights to obtain language features; extracting speech features from speech data, combining language features, facial features, and cognitive state features, capturing temporal correlations through a latent association analysis module, and verifying them through environmental causal intervention to obtain latent association anomaly scores; and adaptively modulating the basic weights, cognitive state features, and latent association anomaly scores to obtain the candidate's evaluation results.

[0006] In conjunction with the first aspect mentioned above, in one possible implementation, the construction and feature extraction process of the causal illumination adaptive cascade architecture specifically includes: constructing a frequency domain decomposition module to perform Fast Fourier Transform on each frame of the video data, decomposing it into amplitude and phase components; constructing an illumination mask generation network and an expression mask generation network, where the illumination mask generation network takes the amplitude component as input and outputs an illumination-sensitive channel mask; and the expression mask generation network takes the phase component as input and outputs an expression-sensitive channel mask. A channel selection normalization module is constructed, with its input connected to the outputs of the frequency domain decomposition module, the illumination mask generation network, and the expression mask generation network, used to perform channel selection normalization processing on the input feature map based on the illumination-sensitive channel mask and the expression-sensitive channel mask, outputting illumination-adaptive features; and constructing and freezing a DINOv2 teacher network, which is used to extract illumination-invariant high-level semantic features of the image as the distillation target. A feature distillation module is constructed, with its input connected to the channel selection normalization module and the output of the DINOv2 teacher network. This module constrains the student network to align with a pre-defined set of illumination-independent feature channels in the DINOv2 teacher network, outputting environment-decoupled facial features. A cognitive-physical decoupling module is also constructed to perform head pose estimation and temporal decomposition on the environment-decoupled facial features, obtaining low-frequency and high-frequency components. Based on the statistical features of the high-frequency components, a cognitive state classifier outputs candidate cognitive state features.

[0007] In conjunction with the first aspect mentioned above, in one possible implementation, the operation of the cognitive-physical decoupling module specifically includes: predicting head pose for each frame of images corresponding to the facial features decoupled from the environment using the 6DRepNet model, and outputting the head rotation matrix in each frame. The rotation matrix sequence is then filtered using a low-pass filter to obtain low-frequency and high-frequency components. Statistical features of the high-frequency components within a preset window are extracted and combined with a multilayer perceptron classifier to obtain the cognitive state features of the candidate, where the cognitive state features include the probabilities of at least the thinking state, recall state, paused state, and normal state.

[0008] In conjunction with the first aspect mentioned above, in one possible implementation, the environment-decoupled facial features specifically include: projecting features onto the expression-sensitive channel mask to generate illumination-independent expression features; extracting eye gaze features from each frame of the video data and modulating them with cognitive state probabilities to obtain cognitively perceived eye gaze features; and extracting head features from each frame of the video data and combining them with physical pose mapping features to obtain pose-independent head features.

[0009] In conjunction with the first aspect mentioned above, one possible implementation involves semantically decoupling language data and combining it with corresponding competencies and basic weights to obtain language features. Specifically, this includes: performing syntactic analysis on the language data and calling an industry terminology dictionary for term matching to identify structured component vectors, slang component vectors, and sentiment component vectors. Based on the target job type, a job knowledge graph is queried to obtain the corresponding job perception weight configuration table, which includes fusion weight coefficients for structured components, slang components, and sentiment components corresponding to the job type. The structured component vectors, slang component vectors, and sentiment component vectors are then weighted and fused according to the fusion weight coefficients to generate preliminary language feature vectors. A linear transformation is used to project these preliminary language feature vectors onto a space composed of competencies, outputting job-aligned language features. Each dimension of the job-aligned language features corresponds one-to-one with the competency score data.

[0010] In conjunction with the first aspect mentioned above, one possible implementation involves capturing temporal correlations through a latent correlation analysis module and verifying them through environmental causal intervention to obtain a latent correlation anomaly score. Specifically, this process includes: temporal alignment of language features, speech features, facial features, and cognitive state features; capturing potential correlations among modal features using a multi-head cross-attention mechanism to obtain cross-modal joint representations and modality consistency scores; anomaly detection and classification of the cross-modal joint representations and modality consistency scores to obtain the original anomaly probability; obtaining the environmental context vector; and using an environmental interference prediction network to obtain the environmental anomaly probability, where the environmental context vector includes illumination type encoding and pose offset encoding; and performing environmental calibration operations on the environmental anomaly probability against the original anomaly probability to output the environmentally calibrated latent correlation anomaly score.

[0011] In conjunction with the first aspect mentioned above, in one possible implementation, the environmental calibration calculation process specifically includes: obtaining the original anomaly probability. and environmental variation probability The latent association anomaly score is obtained using the environmental calibration function, which is: .in, Implicit association anomaly score. The original anomaly probability has a value range of [0, 1]. The probability of environmental anomaly is given, and its value ranges from [0, 1]. This is a preset environmental calibration coefficient used to control the degree of attenuation of the abnormal score by environmental factors, and its value ranges from [0, 1].

[0012] In conjunction with the first aspect mentioned above, one possible implementation involves adaptively modulating the basic weights, cognitive state features, and latent association anomaly scores to obtain the candidate's evaluation result. Specifically, this includes: dynamically modulating the basic weights based on cognitive state features to obtain cognitive modulation weights; modulating the cognitive modulation weights using latent association anomaly scores to obtain the final evaluation weights; and using the competency score data, calculating the candidate's total job fit score using the final evaluation weights, and recording this as the evaluation result.

[0013] Secondly, it provides an AI-based intelligent training and evaluation system for structured interviews, including: a job knowledge graph construction module for building a job knowledge graph containing job type, competencies, and basic weights; a multimodal data acquisition module for collecting multimodal data from candidates' target jobs and the interview process, including language data, speech data, and video data; a feature extraction module for extracting facial features and cognitive state features from video data using a causal illumination adaptive cascade architecture; semantic decoupling of language data based on the target job type, combined with corresponding competencies and basic weights to obtain language features; extraction of speech features from speech data, combined with language features, facial features, and cognitive state features, capturing temporal correlations through a latent association analysis module, and verifying through environmental causal intervention to obtain latent association anomaly scores; and a comprehensive evaluation module for adaptively modulating the basic weights, cognitive state features, and latent association anomaly scores to obtain the candidate's evaluation results.

[0014] In conjunction with the second aspect mentioned above, in one possible implementation, the causal illumination adaptive cascade architecture includes: a frequency domain decomposition module, used to perform a Fast Fourier Transform on each frame of the video data, decomposing it into amplitude and phase components; an illumination mask generation network, connected to the frequency domain decomposition module, used to output an illumination-sensitive channel mask with the amplitude component as input; an expression mask generation network, connected to the frequency domain decomposition module, used to output an expression-sensitive channel mask with the phase component as input; a channel selection normalization module, connected to the frequency domain decomposition module, the illumination mask generation network, and the expression mask generation network respectively, used to perform channel selection normalization processing on the input feature map based on the illumination-sensitive channel mask and the expression-sensitive channel mask, outputting illumination-adaptive features; and a DINOv2 teacher network, used to extract illumination-invariant high-level semantic features of the image as the distillation target. The feature distillation module, connected to both the channel selection normalization module and the DINOv2 teacher network, constrains the student network to align with a pre-defined set of illumination-independent feature channels in the DINOv2 teacher network, outputting environment-decoupled facial features. The cognitive-physical decoupling module, connected to the feature distillation module, performs head pose estimation and temporal decomposition on the environment-decoupled facial features, obtaining low-frequency and high-frequency components. Based on the statistical features of the high-frequency components, it outputs candidate cognitive state features through a cognitive state classifier.

[0015] This application provides an AI-based intelligent training and evaluation system and method for structured interviews. It constructs a job knowledge graph containing job type, competency, and basic weights, transforming abstract job requirements into structured, quantitative evaluation benchmarks. This addresses the generation-level problems of traditional evaluation standards being disconnected from actual job requirements and lacking job-appropriate evaluation standards, laying the foundation for customized job evaluation throughout the entire process. Simultaneously, it extracts facial and cognitive state features from multimodal data through a causal, illumination-adaptive cascade architecture free from environmental interference, solving the application-level problem of visual features being distorted by lighting conditions and unable to accurately identify the candidate's true state, providing precise visual dimension evaluation criteria. Furthermore, it effectively decouples the semantics of language data based on the target job and combines competency and basic weights to obtain language features, solving the application-level problem of misjudgment of language expression and invalidation of scoring standards due to job semantic drift, achieving precise job-specific interpretation of language features. Subsequently, it uses the fused multimodal features, through implicit association analysis and environmental causal intervention verification, to obtain implicit association anomaly scores, solving the management-level problem of lost implicit association information and susceptibility to false anomaly judgments in independent multimodal scoring, achieving objective verification of cross-modal behavioral consistency. Finally, the evaluation results are obtained by adaptively modulating the basic weights, cognitive state features, and implicit correlation anomaly scores. This solves the management-level problems of static mapping between general evaluation dimensions and job competency, and the lack of job-specific interpretation of the same behavior. It achieves job-adaptive and accurate output of evaluation results, improves the matching degree between people and jobs and the efficiency of recruitment and screening, and reduces the mis-screening rate of high-quality talents. Attached Figure Description

[0016] Figure 1 A flowchart illustrating the structured interview intelligent training and evaluation method based on artificial intelligence provided in this application embodiment; Figure 2 A flowchart illustrating the steps of extracting facial features and cognitive state features from video data using a causal illumination adaptive cascade architecture in the structured interview intelligent training and evaluation method based on artificial intelligence provided in this application embodiment; Figure 3 In the structured interview intelligent training and evaluation method based on artificial intelligence provided in the embodiments of this application, the language data is semantically decoupled and combined with the corresponding competence and basic weights to obtain a flowchart of the language feature steps; Figure 4 The flowchart of the structured interview intelligent training and evaluation method based on artificial intelligence provided in the embodiments of this application is shown in the figure. The method captures temporal correlation through the implicit correlation analysis module and verifies it through environmental causal intervention to obtain the implicit correlation abnormal score step. Figure 5 This is a schematic diagram of the structure of the AI-based intelligent training and evaluation system for structured interviews provided in this application embodiment. Detailed Implementation

[0017] In the description of this application, unless otherwise stated, " / " means "or," for example, A / B can mean A or B. The "and / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. Furthermore, "at least one" means one or more, and "multiple" means two or more. The terms "first," "second," etc., do not limit the quantity or order of execution, and "first," "second," etc., do not necessarily imply differences.

[0018] like Figure 1 As shown in the embodiments of this application, the structured interview intelligent training and evaluation method based on artificial intelligence includes: Step 101: Construct a job knowledge graph, which includes job type, competency, and basic weights.

[0019] Among them, job type is a category label obtained by summarizing and clustering job descriptions and industry classification standards. Competency is an ability element extracted from massive recruitment data using natural language processing technology to describe and evaluate whether a candidate is suitable for a particular job, such as technical depth, communication skills, and logical thinking. Basic weight is a quantitative value that represents the importance of each competency to a specific job type, calculated through statistical analysis or machine learning models.

[0020] In some implementations, when a new job posting is initiated or received, a massive amount of publicly available job description information is first collected from mainstream recruitment websites and internal corporate databases via a web crawler module or API interface. Natural language processing technology can then be used to preprocess the collected job descriptions, perform word segmentation, and extract key information. Through a pre-trained semantic role labeling model and industry dictionary, specific job names and corresponding job requirements are identified and extracted from the text. Cluster analysis can then be performed on the extracted job names to group similar jobs into a unified job type. At the same time, semantic parsing is performed on the job requirements descriptions, and through latent semantic analysis or topic modeling, they are summarized into pre-defined or dynamically generated competencies.

[0021] By using statistical analysis methods, the frequency and context weight of each competency appearing in all job descriptions of its corresponding job type are calculated, and then quantified using algorithms such as TF-IDF or TextRank to obtain the basic weight of each competency under each job type.

[0022] Finally, the extracted job types, summarized competencies, and calculated basic weights are stored in a graph database in the form of triples to construct a job knowledge graph containing nodes (job types, competencies) and edges (weight relationships), providing an adaptive evaluation benchmark for candidate assessment.

[0023] It's worth noting that the job knowledge graph has a built-in mechanism for regular updates. It can automatically re-collect the latest recruitment data monthly or quarterly for incremental learning, thereby dynamically adjusting competency and basic weights to adapt to changes in industry trends and the evolution of corporate hiring standards. It also provides a manual calibration interface, allowing corporate HR or business experts to fine-tune the automatically generated competency and basic weights according to their specific hiring standards, achieving a fusion of the general model and the company's personalized needs.

[0024] Step 102: Collect multimodal data on the candidate's target position and the interview process. Multimodal data includes language data, voice data, and video data.

[0025] Multimodal data refers to the data set reflecting different dimensions of a candidate's performance, acquired simultaneously through different sensors and acquisition channels during a complete structured interview. Language data refers to the textual information of the candidate's answers, acquired through a high-sensitivity microphone and converted using automatic speech recognition technology; it carries the candidate's knowledge reserves, logical structure, and semantic expression. Voice data refers to the raw audio signal, derived from the same source as language data but without text conversion; it contains paralinguistic information such as speech rate, tone, pauses, and energy, reflecting the candidate's emotional state and confidence. Video data refers to a continuous sequence of image frames of the candidate's face and upper body, captured in real-time by a camera, recording the candidate's micro-expressions, eye contact, head posture, and other non-verbal behaviors.

[0026] In some implementations, when a candidate enters the interview system through a front-end interview page (such as an H5 page or a dedicated client), the candidate will be prompted to select or confirm the target position they are applying for. The selection result will be recorded and transmitted to the subsequent evaluation process for reference in the job knowledge graph.

[0027] After a candidate confirms the start of the interview, multiple data acquisition threads are simultaneously initiated, utilizing real-time communication technologies such as WebRTC to access the microphone and camera of the terminal device for synchronous data acquisition. During the acquisition process, the microphone captures the candidate's voice signal at a sampling rate of 16kHz or higher. This signal is transmitted in real-time to the speech processing module for acoustic feature extraction, and simultaneously converted into text in real-time by a streaming automatic speech recognition engine, forming language data. The camera captures facial video of the candidate at a frame rate of no less than 25 frames per second, and each frame is marked with a precise timestamp to ensure strict temporal alignment of subsequent multimodal data.

[0028] Finally, all the collected raw data (raw audio stream, converted text stream, and raw video frames) are initially cached and compressed on the terminal, and then uploaded to the backend server in real time via an encrypted transmission protocol, providing raw input for subsequent causal illumination adaptive cascade architecture processing and multimodal feature extraction.

[0029] Step 103: Run the causal illumination adaptive cascade architecture to extract facial features and cognitive state features from the video data.

[0030] The causal-illumination adaptive cascade architecture is a multi-level interconnected neural network processing pipeline used to extract robust facial features from video data that are unaffected by ambient lighting, while simultaneously identifying the candidate's cognitive state. Facial features, after illumination decoupling processing, are high-dimensional feature vectors that accurately reflect the candidate's facial muscle movements and expression changes, primarily eliminating false texture information caused by lighting conditions (such as top lighting and shadows, side lighting and contrast). Cognitive state features refer to the state labels and probabilities reflecting the candidate's internal thought processes, identified by analyzing the candidate's dynamic head behavior. These include states such as thinking, recalling, paused, and normal, which help determine the candidate's mental activity and fluency when answering questions.

[0031] In some implementations, when frames of video data are input into the causal lighting adaptive cascade architecture, each frame is first converted from the spatial domain to the frequency domain by performing a Fast Fourier Transform, and then decomposed into amplitude and phase components. The amplitude component encodes the lighting information of the scene, while the phase component preserves the facial expression and structural information.

[0032] The amplitude component can then be fed into a lighting mask generation network consisting of three convolutional neural networks. This network outputs a lighting-sensitive channel mask with the same dimension as the number of channels in the feature map through convolution, global average pooling, and fully connected layers. This is used to identify which feature channels primarily respond to changes in illumination. Simultaneously, the phase component is fed into a structurally symmetrical expression mask generation network, outputting an expression-sensitive channel mask. This is used to identify which feature channels primarily respond to facial expression changes.

[0033] The original input image is processed by a lightweight convolutional neural network to extract a basic feature map F. This feature map F is then combined with two masks. and Together, they are fed into the channel selection and normalization module to perform core operations and obtain an illumination-adaptive feature map. This feature map can then be input into the student network. At the same time, the same frame image is fed into a DINOv2 teacher network that has been pre-trained on a large-scale image dataset with its parameters completely frozen. The feature distillation module calculates the mean squared error loss of the student network features and the teacher network features on the illumination-independent feature channel set S, which has been determined in advance through illumination perturbation experiments. The student network is then optimized through backpropagation so that its output environmentally decoupled facial features are aligned with the teacher network in high-level semantics.

[0034] Finally, the facial features decoupled from the environment are fed into the cognitive-physical decoupling module. First, the head rotation matrix of each frame is estimated using 6DRepNet. Then, it is decomposed into low-frequency components representing physical sitting posture and high-frequency components representing cognitive activity by a low-pass filter with a cutoff frequency of 0.5Hz. Based on the statistical features such as variance and peak frequency of the high-frequency components, a two-layer multilayer perceptron classifier outputs the cognitive state probability of the candidate at that moment, including thinking, recalling, pausing, or normal state, thereby completing the extraction of facial features and cognitive state features.

[0035] For example, candidate Xiao Li is having an interview in his dormitory, where only the overhead light illuminates the room, causing noticeable shadows to appear on his face, around his eyes and under his nose. When his video frame is input into the causal lighting adaptive cascade architecture, the frequency domain decomposition module first decomposes the image containing the shadows into amplitude and phase components. The amplitude component encodes the light and dark distribution caused by the overhead light, while the phase component preserves the true facial muscle texture when he is focused on thinking.

[0036] At this point, the illumination mask generation network identifies shadow-sensitive channels based on amplitude components and generates... The facial expression mask generation network identifies channels sensitive to focused facial expressions based on phase components and generates... The channel selection normalization module normalizes the corresponding data in the basic feature map F. The channel is subjected to instance normalization to erase shadow differences, corresponding to The channels remain unchanged to preserve the details of focused facial expressions, while the remaining channels undergo batch normalization. At this point, the output illumination-adaptive features are aligned to the illumination-invariant feature space of the DINOv2 teacher network through a feature distillation module, ensuring the purity of high-level semantics.

[0037] Finally, by analyzing the head movements through a cognitive-physical decoupling module, it was found that the low-frequency components were stable while the high-frequency components exhibited regular, slight swaying. The cognitive state was correctly classified as "thinking," and a high probability was output. This effectively utilizes the architecture to successfully extract environment-independent facial features of focused thinking and "thinking state features."

[0038] Step 104: Based on the job type of the target position, semantically decouple the language data and combine it with the corresponding competencies and basic weights to obtain language features.

[0039] In some implementations, after obtaining the candidate's language data text, the text is input into a pre-trained large language model for semantic role labeling and syntactic analysis. Simultaneously, a pre-built industry terminology dictionary is used to identify structured frame markers, industry-specific terms, and sentimental words in the text. Based on the identification results, the language data can be decoupled into three independent component vectors: a structured component vector (containing logical connectors and frame structures), a slang component vector (containing industry terms and technical slang), and a sentimental component vector (containing sentimental words and tone expressions). Based on the target job type selected by the candidate, the corresponding competency list and its basic weights are retrieved from the job knowledge graph. For example, for technical positions, competencies include technical depth, logical thinking, and learning ability, each with different basic weights.

[0040] This allows for the establishment of a semantic reweighting mechanism based on job position awareness. The contribution of each semantic component in the final language features can be dynamically configured according to the job type. For example, for technical positions, the weight of slang is set to 0.5, the weight of structured components is set to 0.3, and the weight of emotional components is set to 0.2. For sales positions, the weight of structured components is set to 0.6, the weight of emotional components is set to 0.3, and the weight of slang is set to 0.1.

[0041] Finally, the three decoupled component vectors are weighted and fused according to the job perception weight to obtain a preliminary language feature vector. This feature vector is then aligned and mapped with the competencies in the job knowledge graph. A linear transformation layer is used to project the preliminary language feature vector onto the competency space, outputting the final job-aligned language feature. Each dimension of this feature vector corresponds to a language performance score for a competency.

[0042] Step 105: Extract speech features from the speech data, combine language features, facial features and cognitive state features, capture temporal correlations through the implicit correlation analysis module, and verify through environmental causal intervention to obtain implicit correlation anomaly scores.

[0043] Among them, speech features refer to acoustic feature vectors reflecting paralinguistic information extracted from the original audio signal, including prosodic features (speech rate, pitch, volume), spectral features, and emotion curves. The latent association analysis module is a cross-modal attention network based on the Transformer architecture, used to capture the potential correlations and consistency of language, speech, and visual modal features over time. Environmental causal intervention verification refers to a counterfactual reasoning mechanism used to determine whether detected cross-modal anomalies are caused by environmental factors (such as abnormal lighting or background noise) rather than actual behavioral abnormalities in the candidate. The latent association anomaly score is a probability value that quantifies the degree of inconsistency between the candidate's words and actions after environmental calibration; a higher score indicates that the cross-modal behavior is more likely to involve concealment, fabrication, or extreme tension.

[0044] In some implementations, the acquired raw audio signal is preprocessed, and the Mel frequency cepstral coefficients are extracted by short-time Fourier transform. At the same time, the speech rate curve, fundamental frequency profile and energy envelope are calculated to obtain a speech feature vector sequence containing prosodic and emotional information.

[0045] The job-aligned language feature sequence, the environmentally decoupled facial feature sequence output by the causal illumination adaptive cascade architecture, and the cognitive state feature sequence can be input together with the speech feature sequence into the latent association analysis module. In the latent association analysis module, the feature sequences of the four modalities are first temporally aligned and fused using a Transformer-based multi-head cross-attention network. Within a sliding time window T, the average cosine similarity between the dominant language features and the audio-video fusion features is calculated. Then, the cross-modal joint representation and modality consistency score are input into a pre-trained anomaly detection classifier. This classifier is a three-layer multilayer perceptron, trained on a large-scale labeled dataset, capable of recognizing abnormal patterns such as active language but tense facial expressions and fluent language but trembling voice, and outputting the original anomaly probability.

[0046] Next, the environmental causal intervention verification stage is entered. The environmental context vector (including illumination type encoding, attitude offset encoding, etc.) obtained from the causal illumination adaptive cascade architecture is input into an environmental disturbance prediction network to obtain the probability that "the detected anomaly is caused by the environment". Finally, the environmental calibration operation is performed to output the final latent correlation anomaly score after environmental calibration.

[0047] Step 106: Adaptively modulate the basic weights, cognitive state features, and latent association anomaly scores to obtain the evaluation results of the candidates.

[0048] Specifically, this includes: dynamically modulating the basic weights based on cognitive state features to obtain cognitive modulation weights; modulating the cognitive modulation weights using latent association anomaly scores to obtain the final evaluation weights; and calling the competency score data (quantitative performance values ​​extracted from candidates' multimodal data such as language, speech, and vision and mapped to each competency in the job knowledge graph, representing the candidate's actual performance level in each dimension) to calculate the candidate's total job matching score using the final evaluation weights, which is recorded as the evaluation result.

[0049] In some implementations, the basic weight vector is received from the job knowledge graph. This vector pre-stores the importance coefficients of each competency for the target position. Simultaneously, the modulator obtains the cognitive state feature vector from the cognitive physics decoupling module of the causal illumination adaptive cascade architecture. This vector contains the probability distribution of cognitive states such as thinking, recalling, pausing, and normal.

[0050] A job cognition mapping matrix was obtained by analyzing the correlation between cognitive status and competency performance across different job types in historical interview data. This is used to define which cognitive states should generate weighted gains for which competencies.

[0051] The first dynamic modulation can then be performed: Calculate cognitive state bias Δ = - ,in A baseline of cognitive state pre-set for the job; Then calculate the cognitive modulation weight. = ⊙(1+γ· ·Δ ), where γ is the preset modulation intensity coefficient, and ⊙ represents element-wise multiplication.

[0052] After the initial modulation is completed, the environmentally calibrated latent correlation anomaly scores are obtained from the latent correlation analysis module. (Value range 0-1). And a pre-defined abnormal attenuation function f(x)=1-δ·x, where δ is the attenuation coefficient.

[0053] The second modulation can then be performed: calculate the final evaluation weights. = ·f( )= ·(1-δ· ).

[0054] Finally, the score data vector S of each competency is obtained from the multimodal feature extraction and decoupling layer, the total job matching score is calculated, and the total score is output as the evaluation result.

[0055] This example illustrates the hiring of an algorithm researcher at a technology company. The basic weights provided by the job knowledge graph are as follows. [Mathematical foundation: 0.35, Programming ability: 0.30, Creative thinking: 0.25, Teamwork: 0.10].

[0056] During interviews, candidates often pause for extended periods when answering complex algorithm questions, indicating cognitive state characteristics. The probability of displaying a thinking state is as high as 0.85%, the probability of a stuck state is 0.10%, and the probability of a normal state is 0.05%.

[0057] Calculate cognitive state bias Δ (Assuming the baseline of the thinking state is 0.4), the cognitive modulation weights are obtained after modulation. The result is [0.35×(1+0.3×0.45)=0.397, 0.30 remains unchanged, 0.25×(1+0.3×0.45)=0.284, 0.10 remains unchanged].

[0058] Meanwhile, cross-modal analysis revealed a high degree of consistency between its language content and micro-expressions, and a latent association anomaly score. The final evaluation weight is only 0.08 after secondary modulation. Approximately [0.397×0.96=0.381, 0.30×0.96=0.288, 0.284×0.96=0.273, 0.10×0.96=0.096].

[0059] The candidate's competence score S can be obtained as follows: [Mathematical foundation: 92, Programming ability: 88, Innovative thinking: 95, Teamwork: 75]. The weighted total job matching score is calculated as: Total = 92 × 0.381 + 88 × 0.288 + 95 × 0.273 + 75 × 0.096 ≈ 93.53 points. This reflects the candidate's excellent potential as an algorithm researcher, correctly converting any pauses in thought into weighted gains in the core dimensions rather than deductions.

[0060] Based on the above technical solution, by constructing a job knowledge graph, abstract job requirements are transformed into structured competencies and their basic weights, which solves the problem of the lack of job adaptive benchmarks in traditional evaluation systems. This provides a dynamically configurable evaluation scale for all subsequent evaluation stages, enabling the evaluation standards to be automatically updated with industry trends and corporate staffing needs, thus reducing manual maintenance costs.

[0061] Secondly, a causal illumination adaptive cascade architecture was used to extract facial features and cognitive state features. By decoupling illumination information from facial expression information at the frequency domain level and using the DINOv2 teacher network for feature distillation, the problem of visual feature semantic drift caused by illumination interference in real interview environments was solved. This ensured that even under non-ideal lighting conditions such as top lighting and side lighting, the candidate's real micro-expressions and cognitive state could still be accurately identified, improving the reliability and validity of visual modality assessment and providing pure visual feature input for subsequent analysis.

[0062] Furthermore, by semantically decoupling language data and combining it with job type to obtain language features, language expression is decomposed into structured, industry jargon, and emotional components, and the weights are dynamically adjusted according to the job. This solves the semantic drift problem of specific positions such as technical positions being misjudged as having poor logic due to their special expression methods. This enables language understanding to truly achieve job self-adaptation and improves the accuracy of technical talent selection.

[0063] Meanwhile, by combining voice features, language features, facial features, and cognitive state features, the implicit association analysis module captures temporal correlations and verifies environmental causal interventions, solving the problem of lost implicit association information caused by multimodal independent scoring. It can effectively identify abnormal patterns such as positive language but tense facial expressions that are inconsistent with behavior, and eliminate false anomalies caused by environmental factors such as lighting and posture through environmental calibration, providing key quantitative indicators that reflect the true credibility of candidates for evaluation.

[0064] Finally, adaptive modulation is applied to the basic weights, cognitive state features, and latent association anomaly scores. By dynamically enhancing the weights of relevant dimensions through cognitive state and globally attenuating the reliability assessment through latent association anomalies, the problem of assessment calibration where the same behavior should be interpreted differently in different job contexts is solved. This ensures that the final job matching score comprehensively reflects the inherent requirements of the job, the candidate's real-time cognitive performance, and the credibility of their behavior, thereby improving the accuracy of person-job matching and the depth validity of the assessment results.

[0065] In another possible implementation of the embodiments of this application, combined with Figure 1-2 As shown, the process of extracting facial features and cognitive state features from video data using a causal illumination adaptive cascade architecture can be achieved through the following steps 201 to 206, which are explained in detail below: Step 201: Construct a frequency domain decomposition module to perform fast Fourier transform on each frame of the video data, decomposing it into amplitude components and phase components.

[0066] The Fast Fourier Transform (FFT) is an algorithm that transforms an image from the spatial domain (pixel coordinates) to the frequency domain (frequency coordinates), decomposing the grayscale variations of an image into sinusoidal components of different frequencies. The amplitude component represents the intensity or energy of different frequency components in the image, primarily encoding information such as global illumination, contrast, and brightness. The phase component represents the positional offset information of different frequency components in the image, mainly preserving edge, texture, and structural details, and is the primary carrier of facial expressions and identity features.

[0067] In some implementations, consecutive image frames from video data are received, and each frame is input into a pre-constructed frequency domain decomposition module. This module then converts the original RGB image to grayscale or processes it directly, and then applies a two-dimensional Fast Fourier Transform (FFT) to transform the image from the spatial domain to the frequency domain.

[0068] After transformation, for each frame of the image, the corresponding complex spectrum is calculated, and the amplitude and phase components are separated through mathematical operations. That is, the amplitude component is obtained by taking the modulus of the complex spectrum, while the phase component is obtained by calculating the complex angle. Subsequently, these two decoupled components are encapsulated separately and passed in parallel as independent outputs to downstream illumination mask generation networks and facial expression mask generation networks specifically designed for them.

[0069] It should be noted that the frequency domain decomposition module simulates the initial decomposition of visual information by the early visual cortex in a biological visual system through mathematical calculations within a computer system. In digital image processing, for an image f(x,y) of size M×N, it is processed by a two-dimensional discrete Fourier transform. It can be decomposed into a series of complex numbers. The spatial location of this frame image Pixel values; For the frequency position of this frame image The intensity of the frequency components on the image. u and v are the coordinates in the frequency domain, where u determines the frequency in the horizontal direction (how fast the image changes in the horizontal direction), and v determines the frequency in the vertical direction (how fast the image changes in the vertical direction).

[0070] The algorithm can be implemented through programming, iteratively calculating the value of each pixel to obtain a new complex matrix F(u,v). Then, the magnitude component is calculated by modulo operation, and the phase component is calculated by arctangent operation.

[0071] For example, a candidate is interviewed in a study with complex lighting, and a frame of their facial video is fed into a frequency domain decomposition module. This module first decomposes the image using a fast Fourier transform, and the resulting amplitude components clearly show the brightness distribution pattern of the face, which is bright on the left and dark on the right, caused by the side lighting of the desk lamp. The details in the dark area on the right are represented by low energy in the amplitude.

[0072] Meanwhile, the obtained phase components are unaffected by this difference in brightness, and completely preserve the structural information of the candidate's facial muscle texture, wrinkles at the corners of the eyes, and the edge contour of the glasses frame while the candidate is smiling.

[0073] At this point, the amplitude component, carrying illumination information, is sent to the illumination mask generation network, while the phase component, which preserves the pure facial expression structure, is sent to the facial expression mask generation network, laying the foundation for extracting facial expression features without illumination interference.

[0074] Step 202: Construct a lighting mask generation network and an expression mask generation network. The lighting mask generation network takes the amplitude component as input and outputs a lighting-sensitive channel mask; the expression mask generation network takes the phase component as input and outputs an expression-sensitive channel mask.

[0075] The illumination mask generation network is a lightweight convolutional neural network that takes the amplitude component output from the frequency domain decomposition module as input. It analyzes the illumination distribution patterns in the image and outputs an illumination-sensitive channel mask vector with the same dimension as the number of feature map channels, used to identify which feature channels primarily respond to illumination changes. The facial expression mask generation network is a lightweight convolutional neural network with a structure symmetrical to the illumination mask generation network, taking the phase component as input. It analyzes facial expression and structural information in the image and outputs an facial expression-sensitive channel mask vector, used to identify which feature channels primarily respond to facial expression changes.

[0076] In some implementations, both amplitude and phase components are received. The amplitude component is then fed into a lighting mask generation network, which extracts hierarchical features of the lighting through three convolutional layers. A global average pooling layer then compresses the feature map into a feature vector. Finally, a fully connected layer and a sigmoid activation function output a lighting-sensitive channel mask vector with values ​​ranging from 0 to 1, where each element represents the sensitivity of the corresponding feature channel to changes in lighting.

[0077] The phase components are fed into a structurally symmetrical expression mask generation network, and processed through the same three convolutional layers, global average pooling, and fully connected layers. Similarly, a sigmoid activation function is used to output an expression-sensitive channel mask vector, which is used to characterize the sensitivity of each feature channel to changes in expression.

[0078] During training, these two masked networks learn to identify the functional biases of different channels by jointly optimizing the loss function: For illumination masking networks, the training data contains image pairs of the same face under different lighting conditions. Through contrastive learning, the network can identify channels that respond strongly to changes in lighting. For facial expression masking networks, the training data includes images of the same person in different facial expression states, enabling the network to identify channels that respond strongly to changes in facial expression.

[0079] Finally, the outputs of the two networks are then passed in parallel to the channel selection normalization module, providing accurate channel-level prior knowledge for feature modulation.

[0080] It should be noted that the two mask generation networks can simulate the detection of specific stimulus features by neurons through differentiable convolutional operations: the first layer of convolutional kernels learns to detect basic edge and texture orientations, the second layer learns to combine them into more complex patterns, and the third layer abstracts high-level semantics related to lighting or facial expressions. The fully connected layers simulate the integration and decision-making process of neurons, while the sigmoid activation function... This simulates the firing probability of neurons. , in, This indicates the degree to which the i-th channel is activated to process specific information; The feature vectors extracted and flattened after passing through the convolutional layer. Represents the amplitude component (for illumination mask networks). Represents the phase component (for facial masking networks); This is the weight matrix; The bias term is an adjustable threshold that determines the extent to which the i-th channel is not activated.

[0081] By using the backpropagation algorithm, the weight parameters of the convolution kernel can be automatically adjusted, enabling the illumination masking network to learn to focus on channel patterns that change significantly under different illuminations but remain relatively stable under the same illumination, while the facial expression masking network learns to focus on channel patterns that change significantly under different facial expressions, thus realizing a function-specific channel selection mechanism in the computer system.

[0082] For example, in an interview scenario with a candidate under overhead lighting, the amplitude component output by the frequency domain decomposition module clearly encodes the deep shadow patterns under the eye sockets and nose. After receiving this amplitude component, the illumination mask generation network: The first convolutional kernel detected the low-brightness edges of the shadow region; The second layer combines to form the overall outline of the shadow area; The third layer abstracts a high-level pattern of "top light and shadow"; Finally, a light-sensitive channel mask vector is output through a fully connected layer and a sigmoid activation function, where the feature channels related to shadow response are assigned weights close to 1.

[0083] Meanwhile, the phase component preserves the real facial muscle texture of the candidate when they are focused on thinking. After the expression mask generation network processes the phase component, it outputs an expression-sensitive channel mask vector, in which the feature channels related to focused expressions are assigned a weight value close to 1.

[0084] Finally, these two mask vectors are then fed into the channel selection normalization module, which guides the system to normalize the shadow-sensitive channels to eliminate the influence of lighting, while preserving the expression-sensitive channels, thereby ensuring that the facial features extracted later truly reflect the candidate's psychological state rather than environmental interference.

[0085] Step 203: Construct a channel selection normalization module. The input of the channel selection normalization module is connected to the output of the frequency domain decomposition module, the illumination mask generation network, and the expression mask generation network. It is used to perform channel selection normalization processing on the input feature map according to the illumination-sensitive channel mask and the expression-sensitive channel mask, and output illumination-adaptive features.

[0086] The input feature map refers to the basic feature representation extracted from the original image by a lightweight convolutional neural network. Instance normalization is an operation that normallyizes each channel of a single sample independently, effectively eliminating sample-specific style information such as illumination differences.

[0087] In some implementations, the system receives a base feature map F extracted from the original video frame by a lightweight convolutional neural network, an illumination-sensitive channel mask from an illumination mask generation network, and an expression-sensitive channel mask from an expression mask generation network.

[0088] At this point, the channel selection normalization module first performs parallel computation on each channel of the basic feature map F: For channels with high weights in the expression-sensitive channel mask, perform an identity mapping, that is, preserve the original features without modification, to ensure that expression-related details are completely transmitted. For channels with high weights in the light-sensitive channel mask, instance normalization is performed, which removes stylistic differences caused by the lighting environment by subtracting the channel mean and dividing by the standard deviation. For the remaining channels that are neither expression-sensitive nor light-sensitive, batch normalization is performed to preserve the statistical regularities learned by the model on a general dataset.

[0089] The processing results of these three parts are then summed element-wise according to the mask weights, and finally fused to output a feature map that adapts to illumination.

[0090] For example, in an interview scenario, a candidate under overhead lighting creates deep shadows around their eyes and under their nose. In the base feature map F, some channels encode the light and dark boundaries created by these shadows, while others encode the actual facial muscle textures of the candidate when they are focused on thinking. At this point, the lighting mask generation network outputs a light-sensitive channel mask. The channel encoding the shadow boundary is assigned a high weight close to 1, while the expression mask generation network outputs an expression-sensitive channel mask. Assign a high weight close to 1 to the channel that focuses on encoding facial expressions.

[0091] Therefore, after the channel selection normalization module receives these two masks: Expression-sensitive channel mask High-weighted, focused facial expression channels perform identity mapping, fully preserving facial expression details; Light-sensitive channel mask High-weighted shadow boundary channels perform instance normalization, which eliminates the difference in brightness caused by shadows through normalization operations; Batch normalization is performed on the remaining channels to maintain the stability of the feature distribution.

[0092] Ultimately, in the weighted fusion output illumination-adaptive feature map, the candidate's focused expression is clearly preserved, while the shadow interference caused by top lighting is effectively eliminated, providing clean input features for feature distillation.

[0093] Step 204: Construct and freeze the DINOv2 teacher network. The DINOv2 teacher network is used to extract illumination-invariant high-level semantic features of the image as the distillation target.

[0094] The DINOv2 teacher network is a visual Transformer model pre-trained on a large-scale natural image dataset based on self-supervised learning, capable of extracting high-level semantic features with illumination invariance. "Freezing" refers to fixing the weight parameters of all layers after network construction, preventing updates during subsequent training and preserving the knowledge learned during pre-training. Illumination-invariant high-level semantic features refer to the high similarity of feature vectors output in specific feature channels when the same face is input into the network under different lighting conditions. These features primarily encode essential attributes unrelated to illumination, such as facial identity and expression. The distillation objective refers to the output features of the teacher network that the student network needs to imitate and align with during training, providing optimization direction for the feature distillation module.

[0095] In some implementations, a DINOv2 teacher network pre-trained on a large-scale image dataset is loaded, and the parameters of all its layers are set to an untrainable state, i.e., frozen.

[0096] To predetermine the set of illumination-invariant feature channels, an illumination perturbation dataset is constructed. The same face image is subjected to n different types and intensities of illumination transformations, resulting in n images under different illumination conditions. These images are then fed into the DINOv2 teacher network for forward propagation, recording the output value of each feature channel across all n images. The variance of each channel is calculated. A smaller variance indicates that the channel is less sensitive to illumination changes; a variance threshold can then be set. Channels with variances less than the threshold are identified as illumination-invariant feature channels and included in the set.

[0097] During the actual training process, for each input image frame, two paths are executed simultaneously: The images are fed into a lightweight student network for forward propagation to obtain student features; The same image was also fed into the frozen DINOv2 teacher network for forward propagation to obtain teacher features.

[0098] These two features can then be fed into the feature distillation module, but the distillation loss only calculates the differences in the illumination-independent channels in the set, ensuring that the student network aligns with the teacher network only in the illumination-invariant dimension.

[0099] For example, when constructing the light-independent feature channel set of the DINOv2 teacher network, a face image under standard lighting is selected and five different lighting transformations are applied to it: top lighting enhancement, side lighting enhancement, back lighting enhancement, low lighting, and high contrast.

[0100] These five images can then be input into the frozen DINOv2 teacher network to obtain five feature vectors, and the variance of each feature channel can be calculated. Assuming the output values ​​of the 10th channel are 0.82, 0.81, 0.83, 0.80, and 0.82 in the five images respectively, with extremely small variance, this indicates that the channel is insensitive to changes in illumination and is thus classified as an illumination-independent channel added to set S. The output values ​​of the 25th channel are 0.95, 0.45, 0.32, 0.78, and 0.91, respectively. The large variance indicates that this channel is sensitive to light and should not be included in set S.

[0101] Therefore, in subsequent training, when the candidate is in a top-lit environment in the interview video frame, the features extracted by the student network are constrained to align with the teacher network on the channels in the set, ensuring that these channels encode the candidate's real facial expressions rather than shadow artifacts caused by top lighting, thus achieving illumination-robust feature learning.

[0102] Step 205: Construct a feature distillation module. The input of the feature distillation module is connected to the output of the channel selection normalization module and the DINOv2 teacher network. It is used to constrain the student network to align with the pre-determined set of illumination-independent feature channels in the DINOv2 teacher network and output environmentally decoupled facial features.

[0103] Specifically, the environmentally decoupled facial features include: projecting features onto the expression-sensitive channel mask to generate illumination-independent expression features (feature vectors that contain only expression-related information while filtering out the influence of illumination); extracting eye features from each frame of the video data and modulating them with cognitive state probabilities to obtain cognitive perception eye features (eye features that can reflect the influence of cognitive activities); and extracting head features from each frame of the video data and combining them with physical pose mapping features to obtain pose-independent head features (features that only reflect the inherent attributes of the head and do not include the influence of physical factors such as sitting posture).

[0104] In some implementations, the illumination-adaptive feature map output by the channel selection normalization module is fed into the student network for further forward propagation to obtain student features. Simultaneously, the same original image is fed into the frozen DINOv2 teacher network to obtain teacher features. By calling the set of illumination-independent feature channels S, which was determined in advance through illumination perturbation experiments, the loss function can be followed. The loss function only calculates the mean square error on the feature channels in set S. The weight parameters of the student network are optimized through backpropagation so that the student features are aligned with the teacher features on these illumination-invariant dimensions, while the illumination-sensitive channels are allowed to vary freely for subsequent causal interventions. The final output is environmentally decoupled facial features.

[0105] Specifically, the expression-sensitive channel mask is used as an attention weight to extract expression-related channels in student features using a weighted extraction method (i.e., ...). Generate illumination-independent facial features ,in Characteristics of students; This is a feature selector based on an attention mechanism, where the main step is to first select... ( An emoji information map, used to clearly define in The channel positions where facial expression information is stored in all channels of the feature map are used as a set of weights. Perform element-wise multiplication, then spatial pooling or linear transformation on the channels after multiplication, and finally output the feature vector.

[0106] Simultaneously, a dedicated eye-feature extraction network is used to extract raw eye-features from video frames. And obtain the probability of cognitive state from the cognitive physics decoupling module. and substitute = ·(1+α· Modulation and fusion are performed to generate cognitive and perceptual eye features. , where α is the job-adjustable cognitive modulation coefficient.

[0107] In addition, raw head features are obtained through a head feature extraction network. And obtain physical attitude mapping features from the cognitive-physical decoupling module. By subtraction = -MLP( Remove physical pose components to generate pose-independent head features. MLP stands for trainable multilayer perceptron (neural network).

[0108] Finally, these three features together constitute a complete set of environmentally decoupled facial features.

[0109] For example, a candidate is having an interview in a dormitory where overhead lighting causes shadows to appear around their eyes. After processing by the channel selection normalization module, their illumination-adaptive features are sent to the feature distillation module.

[0110] On the set S of illumination-independent feature channels, student features Constrained to DINOv2 teacher characteristics Alignment ensures that these channels encode the candidate's genuine expression of focus, rather than shadow artifacts.

[0111] At this point, the feature projection module uses the expression-sensitive channel mask from Extracting illumination-independent facial features It accurately depicts the candidate's focused expression with a slight frown.

[0112] The eye gaze feature extraction network detected a brief shift in the candidate's gaze, as if they were thinking about a problem, looking out the window; these were the original eye gaze features. It shows a low score, but the probability of cognitive state is low. The readings showed a high level of 0.85 in the thinking state, and after modulation, cognitive and perceptual eye features were generated. This should be correctly interpreted as a normal shift in gaze during thinking, rather than a lack of concentration.

[0113] The head feature extraction network detected a slight forward tilt in the candidate's head due to their sitting posture, and extracted the original head features. It is displayed as a negative feature, but the physical pose mapping feature MLP is subtracted. After that, the generated pose-independent head features This accurately reflects the inherent properties of the candidate's head pose. At this point, these three environmentally decoupled facial features will collectively provide clean input data for latent association analysis.

[0114] Step 206: Construct a cognitive physics decoupling module to perform head pose estimation and temporal decomposition on facial features decoupled from the environment, obtain low-frequency and high-frequency components, and output the cognitive state features of candidates through a cognitive state classifier based on the statistical features of the high-frequency components.

[0115] The operation of the cognitive-physical decoupling module specifically includes: predicting head pose for each frame of facial features decoupled from the environment using the 6DRepNet model, and outputting the head rotation matrix in each frame; filtering the rotation matrix sequence using a low-pass filter to obtain low-frequency and high-frequency components; extracting statistical features of high-frequency components within a preset window, and combining them with a multilayer perceptron classifier to obtain the cognitive state features of the candidate, where the cognitive state features include at least the probabilities of cognitive states such as thinking state, recall state, stuck state, and normal state.

[0116] The cognitive-physical decoupling module is a functional unit used to separate physical and cognitive pose components from facial features decoupled from the environment and to identify the cognitive states of candidates. The 6DRepNet model is a head pose estimation network based on a six-dimensional continuous rotation representation, capable of directly predicting the head's rotation matrix in three-dimensional space. The cognitive state classifier is a neural network model based on a multilayer perceptron, used to output the probability distribution of the candidate's cognitive state based on the statistical characteristics of high-frequency components.

[0117] In some implementations, each frame of facial features decoupled from the environment is input into the 6DRepNet model for head pose prediction. The 6DRepNet model, through its unique six-dimensional rotation representation, directly regresses the head rotation matrix R(t)∈SO(3) in each frame, thus describing the head's orientation in three-dimensional space. This yields a continuous sequence of rotation matrices R(1), R(2), ..., R(t).

[0118] At this point, the rotation matrix sequence can be input into a pre-designed low-pass filter for time-series decomposition. The low-pass filter can be a Butterworth filter or a moving average filter with a cutoff frequency of 0.5Hz. Each component of the rotation matrix is ​​filtered separately to obtain the smoothed low-frequency components. This component primarily characterizes slow posture changes caused by physical factors such as camera position and candidate posture.

[0119] Subsequently, through matrix operations Calculation of high-frequency components This component reflects rapid micro-head movements caused by cognitive activities such as thinking and recalling.

[0120] This allows for the extraction of various statistical features of high-frequency components within a fixed-length sliding time window, including variance, peak frequency, energy, and zero-crossing rate. These statistical features are then concatenated into a feature vector. This feature vector is input into a pre-trained multilayer perceptron classifier, which contains two hidden layers with 128 neurons each. The output layer uses the Softmax activation function, ultimately outputting a four-dimensional probability vector corresponding to the probabilities of the thinking state, recall state, paused state, and normal state.

[0121] It should be noted that 6DRepNet avoids the singularity problem in traditional Euler angle representation through its six-dimensional continuous rotation representation, simulating the precise motion parameter encoding mechanism in the cerebellum. Simultaneously, the decomposition of the rotation matrix sequence by a low-pass filter simulates the vestibular system's stable perception of head posture in the nervous system. The low-frequency components correspond to the cerebellum's function in maintaining posture and balance, while the high-frequency components correspond to the basal ganglia's function in initiating and regulating voluntary movement. The statistical feature calculation of the high-frequency components simulates the cortical process of perceiving and integrating motion signals; variance corresponds to the perception of motion amplitude, and peak frequency corresponds to the perception of motion rhythm. Finally, the multilayer perceptron classifier simulates the premotor cortex's function of recognizing and classifying action intentions, achieving the mapping from motion features to cognitive states through forward and backward propagation algorithms.

[0122] For example, if a candidate for an algorithm engineer position pauses for a long time while answering complex technical questions, the video frames corresponding to his environmentally decoupled facial features are fed into the 6DRepNet model frame by frame to obtain the head rotation matrix for each frame.

[0123] After processing with a low-pass filter, the low-frequency components show that the candidate's overall posture is stable, with no significant body movement. The high-frequency components, however, exhibit regular, slight swaying at approximately 0.8 Hz with moderate amplitude.

[0124] Within a 5-second sliding window, the high-frequency components can be extracted, yielding a variance of 0.25, a peak frequency of 0.8 Hz, and an energy value of 12.5. These statistical features are then fed into a multilayer perceptron classifier, outputting the following probabilities for cognitive states: thinking state 0.85, recall state 0.08, pause state 0.05, and normal state 0.02. Based on this, it is determined that the candidate is in a state of deep thinking, and their pauses in speech should be interpreted as a thought process rather than a sign of incompetence.

[0125] Based on the above technical solution, by setting a frequency domain decomposition module to perform Fast Fourier Transform on video frames, the image can be decoupled into amplitude components representing illumination and phase components representing structure. This solves the technical problems of unclear entanglement between illumination and facial expression features and contamination of visual input sources in the spatial domain at the generation level. Furthermore, by generating illumination masks with amplitude components as input and facial expression masks with phase components as input, respectively outputting illumination-sensitive channel masks and facial expression-sensitive channel masks, this solves the technical problems of traditional methods being unable to distinguish the functional tendencies of feature channels and the difficulty in simultaneously achieving illumination removal and facial expression preservation, thus realizing refined functional division of feature channels. Simultaneously, by constructing a channel selection normalization module, the basic feature maps can be differentiated based on the two masks. This preserves details through identity mapping of the facial expression channel, eliminates differences through instance normalization of the illumination channel, and maintains stability through batch normalization of the remaining channels. This solves the technical problems of multi-source normalization methods being unable to adaptively select features and lacking targeted feature control, thus outputting pure features with illumination adaptation.

[0126] Meanwhile, the DINOv2 teacher network can predetermine the set of illumination-independent feature channels through illumination perturbation experiments, solving the technical problems of lightweight networks being unable to autonomously learn illumination invariance and having unclear distillation targets from the benchmark level, and providing a stable semantic benchmark for feature alignment.

[0127] Subsequently, the feature distillation module is used to constrain the student network, aligning it only with the teacher network on the set of illumination-independent channels. Then, through facial expression masking projection, cognitive state modulation, and physical pose removal, illumination-independent facial expression features, cognitive perception eye features, and pose-independent head features are generated respectively. This solves the technical problem of coupling multiple environmental factors with real behavior and the difficulty of a single feature to fully represent the candidate's state, thus obtaining a complete set of environmentally decoupled facial features.

[0128] Finally, a cognitive-physical decoupling module was constructed. This module uses 6DRepNet to estimate head posture, a low-pass filter to separate low-frequency physical components from high-frequency cognitive components, and a multilayer perceptron to classify cognitive states based on high-frequency statistical features. This solves the technical problems of distinguishing between physical head movements and cognitive head micro-movements, and the inability of machines to perceive candidates' internal thought processes. It enables the accurate identification of deep cognitive states such as thinking, recalling, and pausing from dynamic head behavior, providing key feature inputs that reflect the candidate's true thought process for subsequent adaptive job evaluation.

[0129] In another possible implementation of the embodiments of this application, combined with Figure 1-3 As shown, the process of semantically decoupling language data and combining it with corresponding competencies and basic weights to obtain language features can be achieved through the following steps 301 to 304, which are explained in detail below: Step 301: Perform syntactic analysis on the language data and call the industry terminology dictionary for term matching to identify structured component vectors, slang component vectors, and sentimental component vectors.

[0130] In some implementations, the collected language data text is input into a pre-trained semantic role labeling model that has been fine-tuned for interview scenarios. The model then performs word-by-word analysis of the text, identifying the logical framework and semantic relationships in the sentences. For example, the model can accurately label structures such as "firstly...secondly...finally" and "because...therefore..." as highly structured expressions and quantize them as structured component vectors.

[0131] Simultaneously, a pre-built industry terminology dictionary is invoked, and the text is matched against the pre-built industry terminology dictionary in parallel. Through efficient string matching algorithms or semantic retrieval based on embedding, all industry slang appearing in the text is identified, and a slang component vector representing terminology density and professionalism is generated based on its predefined weight in the dictionary.

[0132] Furthermore, a sentiment classification head is integrated within the semantic role labeling model or its parallel branches. This head analyzes the sentiment tendency and tone intensity of sentences to output a sentiment component vector. Thus, the three vectors above are mathematically computed synchronously and independently through the forward propagation of the neural network and the mapping of the feature extraction layer, which together completes the fine decoupling of the original language data in three dimensions.

[0133] The industry terminology dictionary is a pre-built database of key-value pairs. The keys are industry-specific terms, technical jargon, and abbreviations mined, cleaned, and deduplicated from recruitment documents, technical blogs, and professional literature in specific industries (such as internet, finance, and healthcare). The values ​​are the industry domain tag and technical level tag to which the term belongs. During its use, the dictionary will be continuously iterated by regularly mining new terms from the latest industry news, recruitment needs, and internal interview records.

[0134] It's important to note that the semantic role labeling model is a deep learning-based natural language processing model. Internally, it encodes the contextual information of text through a multi-layered Transformer structure and uses linear layers and a Softmax function to assign semantic role labels relative to core predicates (such as verbs) to each word in a sentence, such as agent, patient, time, and place. This transforms natural language expressions into a structured set of predicate and argument relations. During its training, a corpus containing rich logical structure annotations is required, and targeted fine-tuning is performed in interview scenarios to ensure its ability to recognize logical frameworks in colloquial and informal expressions. Syntactic analysis here specifically refers to using this model to identify logical connectors, subject-verb-object structures, and modification relations in sentences to quantify the completeness of the language expression's framework.

[0135] For example, when a candidate applying for a backend development position answered "Please introduce your project experience", the language data was "I was mainly responsible for that PR. After merging it into the main branch, the CI / CD pipeline was automatically triggered. However, I encountered a circuit breaker problem once. It took me a long time to troubleshoot and finally solved it. It was quite difficult."

[0136] At this point, the semantic role labeling model analyzes predicates such as "mainly responsible for", "merged into...", "encountered...", and "solved" and their relationships, and identifies logical connectors such as "then...", "however...", which encodes this structural information into structured component vectors (e.g., a score of 0.6 in the "logical integrity" dimension and a score of 0.5 in the "causal clarity" dimension).

[0137] The industry terminology dictionary matching module identified three slang terms from the text: “PR”, “CI / CD pipeline”, and “circuit breaker”. After weighted calculation, the slang component vector scored 0.9 in the “backend technology” dimension.

[0138] Meanwhile, the sentiment analysis model identified "quite difficult" as carrying a slight sense of exclamation and self-affirmation, and output a sentiment component vector (scoring 0.3 in the "positive emotion" dimension and 0.4 in the "confidence" dimension).

[0139] This successfully decouples the mixed language data into three feature vectors with different semantic focuses.

[0140] Step 302: Query the job knowledge graph based on the job type of the target job to obtain the corresponding job perception weight configuration table. The job perception weight configuration table includes the fusion weight coefficients of the structured components, slang components and emotional components corresponding to the job type.

[0141] Among them, the job perception weight configuration table is one or more relational data tables embedded in the job knowledge graph. Each row of the table corresponds to a specific job type, and each column defines the fusion weight coefficients corresponding to the structured components, slang components, and emotional components when performing language feature fusion. These coefficients are pre-set according to the job characteristics or quantified values ​​learned from recruitment data through machine learning algorithms.

[0142] In some implementations, after obtaining the job type of the target job, a query request is sent to a pre-built and continuously updated job knowledge graph using that job type as the query key.

[0143] When the job knowledge graph receives the request, it performs a precise search in the job perception weight configuration table stored internally. This configuration table is generated during the knowledge graph construction phase by learning from massive recruitment data through statistical analysis methods. For each defined job type, the configuration table stores a set of fusion weight coefficients that are strongly related to that job. For example, for the "backend development engineer" job, its configuration record is [structure weight: 0.3, slang weight: 0.5, emotional weight: 0.2]; for the "new media operation" job, its configuration record is [structure weight: 0.6, slang weight: 0.1, emotional weight: 0.3].

[0144] After the knowledge graph encapsulates the set of fusion weight coefficients that uniquely correspond to the target job type, it returns them to the semantic decoupling module, which can then serve as the basis for subsequent weighted fusion of structured component vectors, slang component vectors, and sentimental component vectors.

[0145] It's worth noting that the system incorporates a weight self-learning mechanism. It can periodically analyze recruitment results (such as candidate hiring status and post-employment performance) and use models like linear or logistic regression to optimize the weight coefficients for each job type. For example, if data shows that candidates with higher emotional weights perform better in sales, the emotional weighting coefficient for sales positions will be adjusted upwards. Furthermore, the configuration table also provides a manual calibration interface for HR professionals and business experts. This allows experts to fine-tune the automatically generated weight coefficients based on their company's specific culture and hiring preferences. For instance, the weights related to communication and expression can be appropriately increased in companies that value a "teamwork" culture, achieving a deep integration of the general model with the company's personalized needs.

[0146] Take, for example, a large internet company that is simultaneously recruiting for "Algorithm Researcher" and "User Growth Specialist." When a candidate applies for the Algorithm Researcher position and begins the interview, the job type is identified as "Algorithm Researcher." The corresponding job perception weight configuration table in the job knowledge graph is then queried, and the obtained fusion weight coefficient is [Structured: 0.2, Slang: 0.7, Emotion: 0.1]. This indicates that for this position, the accurate use of professional terminology (slang) is far more important than the emotional tone and structural integrity of the language.

[0147] Another candidate applying for the User Growth Specialist position has a configuration table coefficient of [Structured: 0.5, Slang: 0.1, Emotion: 0.4], which reflects that the position requires candidates to clearly and systematically explain solutions (structured) and to use persuasive language to move others (emotional), while industry slang is relatively less important.

[0148] This query process allows for the precise allocation of completely different language evaluation criteria to two candidates in different assessment contexts.

[0149] Step 303: The structured component vector, the slang component vector, and the emotional component vector are weighted and fused according to the fusion weight coefficient to generate a preliminary language feature vector.

[0150] In some implementations, the structured component vector is obtained. slang component vector and emotional component vector It also calls the weight coefficients of the structured component vector corresponding to the target position. Weight coefficients of slang component vectors And the weight coefficients of the emotional component vector First, a dimension alignment check can be performed on the three component vectors to ensure they have the same number of dimensions or can be operated on via a broadcast mechanism. Then, the vectors can be summed after element-wise multiplication with their corresponding scalar coefficients. This allows us to obtain preliminary language feature vectors. .

[0151] Step 304: Project the preliminary language feature vectors onto the space composed of competencies using linear transformation, and output the job-aligned language features. Each component of the job-aligned language features corresponds one-to-one with the competency score data.

[0152] Linear transformation refers to the mathematical operation of mapping an input vector from one space to another in a neural network through a fully connected layer by multiplying it with a trainable weight matrix and optionally adding a bias term. The initial language feature vector is a one-dimensional or multi-dimensional tensor generated by weighted fusion. It retains the quantitative information of the original language data in the three dimensions of structure, professionalism, and emotion, while the weight coefficients reflect the differentiated attention paid to these dimensions by specific job types. The competency space is an abstract high-dimensional vector space, whose coordinate axes are defined by the various competencies (such as technical depth, logical thinking, and communication) defined in the job knowledge graph. Each point in the space represents a specific combination of competency scores. The job-aligned language feature is a multi-dimensional vector whose number of dimensions is completely consistent with the number of competencies defined for a specific job in the job knowledge graph. The value of each component in the vector represents the quantitative performance score of the candidate in that competency, realizing a direct correspondence between language features and job evaluation standards.

[0153] In some implementations, a linear transformation layer is first constructed using a fully connected neural network. The core parameter of this linear transformation layer is a single-dimensional parameter. weight matrix ,in This determines the number of competencies defined for the target job type in the job knowledge graph. This allows us to obtain preliminary language feature vectors. (dimension is) (dense tensors), thereby utilizing the constructed and trained linear transformation layers to perform Calculations were performed to obtain the language features aligned with job positions. .

[0154] It should be noted that the weight matrix in the linear transformation layer It is obtained by collecting a large number of interview response texts, and having domain experts (such as senior HR professionals and business managers) score each job type on preset competencies to form labeled data, and then using supervised training through a large-scale labeled dataset.

[0155] During training, the initial language feature vectors are used as input, and expert scores are used as the target output. The mean squared error loss function is minimized, and the backpropagation algorithm is used for iterative optimization. The parameters allow the model's predicted score to continuously approach the expert's actual score.

[0156] Based on the above technical solution, by using semantic role annotation and an industry terminology database, interview language is precisely decoupled into three independent semantic component vectors: structured, professional terminology (jargon), and emotional. This addresses the problem of semantic drift in general models that misjudge unstructured expressions and professional terminology inherent in technical positions as having poor logic. Furthermore, based on the fusion weights configured according to the target job type, these components are dynamically weighted to achieve job-adaptive calibration of the feature vectors. Finally, a linear transformation projects the calibrated features onto a defined competency dimension, directly outputting an interpretable quantitative score. This effectively constructs an end-to-end mapping link from raw language to job-specific competency assessment, improving the accuracy of technical talent selection.

[0157] In another possible implementation of the embodiments of this application, combined with Figure 1-4 As shown, the process of capturing temporal correlations through the implicit correlation analysis module and obtaining implicit correlation anomaly scores through environmental causal intervention verification can be achieved through the following steps 401 to 404, which are explained in detail below: Step 401: Perform temporal alignment on language features, speech features, facial features, and cognitive state features, and capture the potential correlations of each modality feature through a multi-head cross-attention mechanism to obtain cross-modal joint representation and modality consistency score.

[0158] The multi-head cross-attention mechanism, based on a Transformer architecture neural network layer, uses multiple parallel attention heads to calculate the mutual attention between feature sequences of different modalities from different perspectives, thereby capturing the complex and non-linear potential relationships between them. Cross-modal joint representation is a unified feature vector containing comprehensive information from all modalities, generated by fusing the modal interaction information mined by the multi-head cross-attention mechanism. Modality consistency score is a quantitative indicator calculated based on the cross-modal joint representation, reflecting the degree of fit between language, speech, and visual behavior patterns within a specific time window; a higher score indicates greater consistency between speech and behavior.

[0159] In some implementations, the job-aligned language feature sequence, speech feature sequence, environment-decoupled facial feature sequence, and cognitive state feature sequence are strictly time-aligned based on the timestamp generated synchronously during data collection, to ensure that the features at each time point represent the candidate's performance at the same moment.

[0160] The aligned multimodal feature sequences can then be input into the latent association analysis module based on the Transformer architecture. A multi-head cross-attention network can then use a query vector dominated by language features to query speech and visual feature sequences separately, and calculate the attention weights between language and audio / video features. At this point, the weight matrix quantifies the degree of matching between a candidate's speech tone and facial micro-expressions when the candidate speaks specific content.

[0161] Finally, by aggregating the outputs of these attention mechanisms, a cross-modal joint representation incorporating cross-modal temporal correlation information is generated. Simultaneously, based on this joint representation, the mean cosine similarity between language-dominant features and audio-video fusion features is calculated within a sliding time window and used as a modal consistency score to reflect the strength of candidates' consistency between words and actions.

[0162] For example, a candidate is answering a question about teamwork, and their linguistic features show that they are describing "successfully mediating a conflict" (semantically positive). At this point, a multi-head cross-attention mechanism captures that when the phrase "mediating a conflict" appears in the linguistic features, their synchronous speech features (provided by the speech module) show a slightly faster speech rate and higher pitch, while facial features from the cascaded architecture capture a brief and asymmetrical slight twitch at the corner of their mouth. This cross-modal correlation information is captured by the attention network and aggregated to generate a joint representation.

[0163] The modal consistency score calculated based on this representation is low, thus indicating that there is non-linguistic tension or potential inconsistency behind the positive linguistic content, providing key evidence for subsequent anomaly detection.

[0164] Step 402: Perform anomaly detection and classification on the cross-modal joint characterization and modal consistency score to obtain the original anomaly probability.

[0165] Among them, the common detection classification refers to using a pre-trained machine learning model to determine whether there are abnormal patterns of inconsistency between speech and behavior in the current interview segment, based on the input cross-modal joint representation and modality consistency score as features, and outputting the corresponding probability value. The raw anomaly probability refers to the probability value directly output by the anomaly detection classifier without environmental factor calibration. The value ranges from 0 to 1. The higher the value, the greater the likelihood that the system determines that the current candidate's cross-modal behavior is abnormal, such as concealment, fabrication, or extreme nervousness.

[0166] In some implementations, cross-modal joint representations and modality consistency scores are concatenated into a comprehensive feature vector, which serves as the input to an anomaly detection classifier. In this case, the classifier employs a three-layer Multilayer Perceptron (MLP) architecture, containing two hidden layers with 128 neurons each. The ReLU activation function is used, and the output layer uses a Sigmoid activation function to map the result to the 0-1 range. During model training, a training dataset containing samples from three classes is constructed: The sample consists of 5,000 interview segments marked by professional HR professionals to demonstrate "consistency between words and actions"; The negative samples consist of 2,000 labeled segments of “inconsistent words and actions” (such as concealment, fabrication, extreme tension, etc.). Augmented samples are generated by splicing the language of positive samples with the audio and video of negative samples, thereby improving the model's sensitivity to cross-modal mismatches.

[0167] During training, a binary cross-entropy loss function is used, and AdamW is used as the optimizer. The initial learning rate is set to 1e-4. The network weights are iteratively updated through the backpropagation algorithm, enabling the model to automatically learn and recognize cross-modal abnormal patterns such as "active language but tense expression" and "fluent expression but trembling voice" from the joint representation of the input. Finally, an uncalibrated raw abnormal probability value is output.

[0168] For example, a candidate who answers "I am very willing to accept this challenge" has a cross-modal joint representation that includes positive language content, slightly trembling speech features, and a fleeting asymmetrical micro-expression at the corners of the mouth, while the modality consistency score is only 0.62. These features can be concatenated and input into an anomaly detection classifier. After layer-by-layer calculation by a three-layer neural network, the final Sigmoid output layer gives an original anomaly probability of 0.85.

[0169] This high probability value indicates that the system judges the segment to have a significant pattern of inconsistency between words and actions, suggesting that the candidate may be putting on a brave face or concealing their true thoughts, providing an initial basis for subsequent environmental causal intervention verification.

[0170] Step 403: Obtain the environmental context vector. Through the environmental interference prediction network, obtain the environmental anomaly probability. The environmental context vector contains illumination type encoding and pose offset encoding.

[0171] The environmental context vector is a set of quantified features obtained from the causal lighting adaptive cascade architecture, used to characterize the current interview environment. The environmental interference prediction network is a small neural network based on a multilayer perceptron. It takes the environmental context vector as input and outputs the environmental anomaly probability after forward propagation. The environmental anomaly probability refers to the probability value of detected cross-modal anomalies caused by environmental factors (such as non-ideal lighting or physical posture deviation), ranging from 0 to 1, and is used for environmental calibration of the original anomaly probability. Lighting type encoding is a numerical representation obtained after classifying the lighting conditions of video frames. By analyzing the amplitude component patterns output by the frequency domain decomposition module, the lighting environment is classified into types such as top lighting, side lighting, backlighting, or uniform lighting, and assigned corresponding encoding values. Posture deviation encoding is a quantified feature representation of the low-frequency components in the candidate's head posture caused by physical factors, extracted from the low-frequency physical posture components output by the cognitive-physical decoupling module.

[0172] In some implementations, environmental information is collected in parallel from multiple intermediate modules of a causal illumination adaptive cascade architecture. Specifically, in the frequency domain decomposition module, the illumination distribution pattern obtained from amplitude component analysis is fed into a pre-trained illumination classification head. This classification head then uses Softmax to output the probability distribution of the current frame belonging to one of four types: top lighting, side lighting, backlighting, or uniform lighting. The type code corresponding to the highest probability is taken as the illumination type code.

[0173] Meanwhile, the low-frequency physical attitude components separated by a low-pass filter are obtained from the cognitive physics decoupling module. The components are statistically analyzed within a fixed time window (such as calculating the mean and standard deviation), and then encoded into a fixed-dimensional attitude offset encoding vector through a linear mapping layer.

[0174] Subsequently, the illumination type encoding and pose offset encoding are concatenated and fused to form a complete environment context vector. This vector is then input into the environment disturbance prediction network, which employs a two-layer MLP architecture (64-dimensional hidden layers with ReLU activation function). The output layer uses a Sigmoid activation function to map the environment disturbance probability between 0 and 1.

[0175] The network is trained through supervised learning. The training data consists of manually labeled samples of whether the abnormal phenomenon is caused by the environment (such as misjudgment of facial expression caused by top lighting and shadow, misjudgment of facial asymmetry caused by side lighting, etc.). The loss function is binary cross-entropy, the optimizer is AdamW, and the learning rate is set to 5e-5. Through backpropagation, the network learns to predict the probability of environmental factors causing abnormalities from the environmental context vector.

[0176] For example, a candidate undergoing an interview in a dormitory is under overhead lighting. After frequency domain decomposition, the amplitude components of his video frame show deep shadows around his eyes and under his nose. The lighting classification head outputs an overhead lighting type code based on the amplitude pattern. Simultaneously, the cognitive-physical decoupling module detects that the candidate's posture is stable, with low-frequency physical posture components close to zero, and generates a corresponding posture offset code.

[0177] The illumination type code (top light) and attitude offset code (stable sitting posture) are concatenated and then input into the environmental disturbance prediction network. At this time, the network calculates and outputs an environmental disturbance probability of 0.85.

[0178] This high probability value indicates that there is an 85% chance that abnormal expressions such as "frowning" and "nervousness" that may be detected in the current segment are caused by shadow artifacts created by top lighting, rather than reflecting the candidate's true emotional state. This provides a key basis for attenuation in subsequent environmental calibration.

[0179] Step 404: Perform environmental calibration calculation on the original anomaly probability based on the environmental anomaly probability, and output the implicit correlation anomaly score after environmental calibration.

[0180] The environmental calibration process specifically includes: obtaining the original anomaly probability. and environmental variation probability The latent association anomaly score is obtained using the environmental calibration function, which is: ;in, Implicit association anomaly score; The original anomaly probability has a value range of [0, 1]. The probability of environmental anomaly is given, and its value ranges from [0, 1]. This is a preset environmental calibration coefficient used to control the degree of attenuation of the abnormal score by environmental factors, and its value ranges from [0, 1].

[0181] For example, when a candidate is interviewed under overhead lighting, the anomaly detection classifier outputs a raw anomaly probability of 0.90 (indicating a high degree of anomaly), but the environmental interference prediction network outputs an environmental anomaly probability as high as 0.85 (indicating that the anomaly is likely caused by overhead lighting shadows).

[0182] This allows you to call the preset environmental calibration factor of 0.8. Perform the environmental calibration calculation: =0.288. The final output latent association anomaly score is approximately 0.29, significantly lower than the original 0.90.

[0183] This low score indicates that the detected "abnormality" is mainly due to environmental factors, and the candidate's actual behavior is highly consistent, avoiding the erroneous conclusion that the candidate's words and actions are inconsistent due to lighting issues.

[0184] Based on the above technical solution, a multi-head cross-attention mechanism is used to align language, speech, facial, and cognitive state features, generating a joint representation and consistency score that includes cross-modal temporal correlations. This solves the technical problems of traditional methods that analyze each modality independently and lose interaction information. Simultaneously, an environmental context vector containing information such as illumination and pose is introduced, and the probability of environmental anomalies is predicted, thereby enabling the tracing of the causes of anomalies. Furthermore, this probability is used to calibrate the original behavioral anomaly probability detected based on the joint representation, filtering out feature contamination caused by environmental interference, and finally outputting a clean, environmentally calibrated anomaly score that represents the true credibility of the candidate. This allows for the construction of a complete credibility assessment chain from multimodal fusion and anomaly identification to environmental decoupling calibration, improving the objectivity and accuracy of the assessment results.

[0185] This application embodiment can divide the structured interview intelligent training and evaluation system based on artificial intelligence into functional units according to the above method example. For example, each function can be divided into a separate functional unit, or two or more functions can be integrated into the same processing unit. The integrated unit can be implemented in hardware or as a software functional unit. It should be noted that the unit division in this application embodiment is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods.

[0186] When using integrated units, Figure 5 The above embodiments illustrate an AI-based structured interview intelligent training and evaluation system, including: The Job Knowledge Graph Construction Module is used to build a job knowledge graph, which includes job type, competency, and basic weights.

[0187] The multimodal data acquisition module is used to collect multimodal data of candidates during their target positions and interviews. The multimodal data includes language data, voice data, and video data.

[0188] The feature extraction module is used to extract facial and cognitive state features from video data using a causal illumination adaptive cascade architecture. Based on the job type of the target position, the language data is semantically decoupled and combined with corresponding competencies and basic weights to obtain language features. Speech features are extracted from the speech data. Combined with language, facial, and cognitive state features, a latent association analysis module captures temporal correlations, and environmental causal intervention is used for verification to obtain a latent association anomaly score.

[0189] The comprehensive evaluation module is used to adaptively modulate the basic weights, cognitive state features, and latent association anomaly scores to obtain the evaluation results of the candidates.

[0190] Although this application has been described in conjunction with specific features and embodiments, it is obvious that various modifications and combinations can be made thereto without departing from the spirit and scope of this application. Accordingly, this specification and drawings are merely exemplary illustrations of this application as defined by the appended claims, and are considered to cover any and all modifications, variations, combinations, or equivalents within the scope of this application. Clearly, those skilled in the art can make various alterations and modifications to this application without departing from the spirit and scope of this application. Thus, if such modifications and variations of this application fall within the scope of the claims of this application and their equivalents, this application is also intended to include such modifications and variations.

Claims

1. An artificial intelligence-based structured interview intelligent training evaluation method, characterized in that, include: Construct a job knowledge graph, which includes job type, competency, and basic weights; Collect multimodal data of candidates' target positions and the interview process, including language data, voice data, and video data; The causal illumination adaptive cascaded architecture is used to extract facial features and cognitive state features from the video data. Based on the job type of the target position, the language data is semantically decoupled, and language features are obtained by combining the corresponding competencies and basic weights. Speech features are extracted from the speech data. Combined with language features, facial features, and cognitive state features, temporal correlations are captured through a latent correlation analysis module. The latent correlation anomaly score is obtained through environmental causal intervention verification. The evaluation results of the candidates are obtained by adaptively modulating the basic weights, cognitive state features, and latent association anomaly scores. 2.The AI-based structured interview intelligent training evaluation method according to claim 1, characterized in that, The construction and feature extraction process of the causal illumination adaptive cascade architecture specifically includes: A frequency domain decomposition module is constructed to perform a fast Fourier transform on each frame of the video data, decomposing it into amplitude components and phase components. A lighting mask generation network and an expression mask generation network are constructed. The lighting mask generation network takes the amplitude component as input and outputs a lighting-sensitive channel mask. The expression mask generation network takes the phase component as input and outputs an expression-sensitive channel mask. A channel selection normalization module is constructed. The input of the channel selection normalization module is connected to the output of the frequency domain decomposition module, the illumination mask generation network, and the expression mask generation network. It is used to perform channel selection normalization processing on the input feature map according to the illumination-sensitive channel mask and the expression-sensitive channel mask, and output illumination-adaptive features. Construct and freeze the DINOv2 teacher network, which is used to extract illumination-invariant high-level semantic features of the image as distillation targets; A feature distillation module is constructed, the input of which is connected to the channel selection normalization module and the output of the DINOv2 teacher network. This module is used to constrain the student network to align with a pre-determined set of light-independent feature channels in the DINOv2 teacher network and output environmentally decoupled facial features. A cognitive-physical decoupling module is constructed to perform head pose estimation and temporal decomposition on the facial features decoupled from the environment, obtain low-frequency components and high-frequency components, and output the cognitive state features of the candidate through a cognitive state classifier based on the statistical features of the high-frequency components. 3.The AI-based structured interview intelligent training evaluation method according to claim 2, characterized in that, The operation process of the cognitive physics decoupling module specifically includes: The 6DRepNet model is used to predict the head pose of each frame of the facial features decoupled from the environment, and the rotation matrix of the head in each frame is output. The rotation matrix sequence is filtered using a low-pass filter to obtain low-frequency and high-frequency components; Statistical features of the high-frequency components within a preset window are extracted and combined with a multilayer perceptron classifier to obtain the cognitive state features of the candidate. The cognitive state features include the probability of cognitive states including at least thinking state, recall state, stuck state and normal state.

4. The structured interview intelligent training and evaluation method based on artificial intelligence according to claim 2, characterized in that, The facial features described in the context of environmental decoupling specifically include: Feature projection is performed on the expression-sensitive channel mask to generate illumination-independent expression features; The eye gaze features of each frame in the video data are extracted and modulated with the cognitive state probability to obtain cognitive perception eye gaze features; Head features are extracted from each frame of the video data and combined with physical pose mapping features to obtain pose-independent head features.

5. The structured interview intelligent training and evaluation method based on artificial intelligence according to claim 1, characterized in that, The process of semantically decoupling the language data and combining it with the corresponding competencies and basic weights to obtain language features specifically includes: The language data is subjected to syntactic analysis, and an industry terminology dictionary is called to perform term matching to identify structured component vectors, slang component vectors, and sentimental component vectors. Based on the job type of the target job, query the job knowledge graph to obtain the corresponding job perception weight configuration table. The job perception weight configuration table includes the fusion weight coefficients of the structured components, slang components and emotional components corresponding to the job type. The structured component vector, the slang component vector, and the emotional component vector are weighted and fused according to the fusion weight coefficient to generate a preliminary language feature vector. The preliminary language feature vector is projected onto the space composed of the competencies using the linear transformation, and job-aligned language features are output. Each component of the job-aligned language features corresponds one-to-one with the competency score data.

6. The structured interview intelligent training and evaluation method based on artificial intelligence according to claim 1, characterized in that, The process of capturing temporal correlations through the latent correlation analysis module and verifying them through environmental causal intervention to obtain latent correlation anomaly scores specifically includes: The language features, speech features, facial features, and cognitive state features are temporally aligned, and the potential correlations of each modality feature are captured through a multi-head cross-attention mechanism to obtain cross-modal joint representations and modality consistency scores. Anomaly detection and classification are performed on the cross-modal joint characterization and modal consistency score to obtain the original anomaly probability; An environmental context vector is obtained, and an environmental disturbance prediction network is used to obtain the environmental anomaly probability. The environmental context vector includes illumination type encoding and attitude offset encoding. The environmental anomaly probability is used to perform an environmental calibration operation on the original anomaly probability, and the environmentally calibrated latent correlation anomaly score is output.

7. The structured interview intelligent training and evaluation method based on artificial intelligence according to claim 6, characterized in that, The environmental calibration calculation process specifically includes: Obtain the original anomaly probability and the probability of environmental anomalies The latent association anomaly score is obtained using an environmental calibration function, which is: ; in, Implicit association anomaly score; The original anomaly probability has a value range of [0, 1]. The probability of environmental anomaly is given, and its value ranges from [0, 1]. This is a preset environmental calibration coefficient used to control the degree of attenuation of the abnormal score by environmental factors, and its value ranges from [0, 1].

8. The structured interview intelligent training and evaluation method based on artificial intelligence according to claim 1, characterized in that, The process of adaptively modulating the basic weights, cognitive state features, and latent association anomaly scores to obtain the evaluation results of the candidates specifically includes: Based on the cognitive state characteristics, the basic weights are dynamically modulated to obtain cognitive modulation weights; The cognitive modulation weights are modulated using the implicit correlation anomaly scores to obtain the final evaluation weights; The competency score data is retrieved, and the candidate's total job suitability score is calculated using the final evaluation weight, which is recorded as the evaluation result.

9. An AI-based intelligent training and evaluation system for structured interviews, characterized in that: The structured interview intelligent training and evaluation system, applied to any one of claims 1-9, specifically includes: The job knowledge graph construction module is used to construct a job knowledge graph, which includes job type, competency, and basic weights. A multimodal data acquisition module is used to collect multimodal data of candidates during their target positions and interviews. The multimodal data includes language data, voice data, and video data. The feature extraction module is used to extract facial features and cognitive state features from the video data by running a causal illumination adaptive cascade architecture; based on the job type of the target job, the language data is semantically decoupled and combined with the corresponding competency and basic weights to obtain language features; the speech features of the speech data are extracted, and combined with the language features, facial features and cognitive state features, the temporal correlation is captured by the implicit correlation analysis module, and the implicit correlation anomaly score is obtained by verifying through environmental causal intervention; The comprehensive evaluation module is used to adaptively modulate the basic weights, cognitive state features, and latent association anomaly scores to obtain the evaluation results of the candidates.

10. The AI-based intelligent training and evaluation system for structured interviews according to claim 9, characterized in that, The causal illumination adaptive cascade architecture includes: The frequency domain decomposition module is used to perform a fast Fourier transform on each frame of the video data, decomposing it into amplitude components and phase components. An illumination mask generation network, connected to the frequency domain decomposition module, is used to output an illumination-sensitive channel mask with the amplitude component as input. An expression mask generation network, connected to the frequency domain decomposition module, is used to output an expression-sensitive channel mask with the phase component as input. The channel selection normalization module is connected to the frequency domain decomposition module, the illumination mask generation network, and the expression mask generation network, respectively. It is used to perform channel selection normalization processing on the input feature map according to the illumination-sensitive channel mask and the expression-sensitive channel mask, and output illumination-adaptive features. The DINOv2 teacher network is used to extract illumination-invariant high-level semantic features of the image as the distillation target; The feature distillation module is connected to the channel selection normalization module and the DINOv2 teacher network, respectively, and is used to constrain the student network to align with a pre-determined set of light-independent feature channels in the DINOv2 teacher network, and output environmentally decoupled facial features. The cognitive physics decoupling module, connected to the feature distillation module, is used to perform head pose estimation and temporal decomposition on the facial features decoupled from the environment to obtain low-frequency and high-frequency components. Based on the statistical features of the high-frequency components, the cognitive state features of the candidate are output through a cognitive state classifier.