An old-age adaptation video script generation method and system based on multi-modal cognition
By constructing a five-dimensional quantitative indicator system and a multimodal deep network model, age-friendly video scripts are automatically generated, solving the problem of the disconnect between professional knowledge and the cognitive adaptation characteristics of the elderly. This achieves efficient and unified video script generation, improving content adaptability and quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 湖南工商大学
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
Smart Images

Figure CN121885231B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a method and system for generating age-friendly video scripts based on multimodal cognition. Background Technology
[0002] Currently, the industry's approach to transforming professional health knowledge into age-friendly video scripts primarily relies on manual translation, supplemented by simple text simplification techniques. On one hand, collaborative teams comprised of geriatric medicine experts, science writers, and content designers manually interpret and rewrite professional health texts such as medical guidelines, academic papers, and disease encyclopedias, translating professional expressions into language easily understood by the elderly. Simultaneously, they manually design the logical structure and conversational style of the video scripts. On the other hand, some platforms employ general text simplification models, such as using pre-trained language models to replace vocabulary and adjust sentence structures in professional health texts, achieving simple popularization. Subsequent script adaptation and optimization are then performed manually. Furthermore, in the production of existing health science content, some platforms conduct simple qualitative evaluations of content in elderly-specific sections, judging whether the content aligns with the reading and comprehension habits of the elderly based on subjective experience. A few studies have attempted basic quantitative analysis of the text simplification effect, but a systematic adaptation evaluation standard has not yet been established.
[0003] Existing practices have revealed several significant shortcomings in practical applications: there is a serious disconnect between professional health knowledge and cognitive suitability characteristics; the subjective nature of manual conversion is strong; and general text simplification models only focus on superficial modifications of a single text modality without tailoring them to the cognitive patterns of the elderly. The generated content still suffers from problems such as dense technical jargon, excessively long logical chains, and too many abstract concepts, making it difficult for the elderly to understand. The manual conversion-based model is extremely inefficient, with the production cycle for a single age-friendly video script taking several weeks. Furthermore, the varying levels of expertise and experience among different production teams result in inconsistent script quality, making it difficult to establish unified production standards. There is a lack of a quantifiable professional knowledge-age cognitive suitability assessment system. Existing evaluations are mostly qualitative descriptions, lacking calculable quantitative indicators for key suitability dimensions such as conceptual complexity and language affinity. This makes it impossible to objectively measure the conversion effect and to systematically optimize the script generation process.
[0004] Therefore, how to construct a quantitative professional knowledge-age cognitive adaptation assessment system, integrate multimodal features to achieve the automated generation of professional health knowledge into video scripts that conform to cognitive adaptation features, and at the same time take into account the medical accuracy of the content and the cognitive adaptability of the elderly, has become an urgent problem to be solved. Summary of the Invention
[0005] The main purpose of this application is to provide a method and system for generating age-friendly video scripts based on multimodal cognition, aiming to solve the technical problem of how to achieve the automated generation of video scripts that conform to cognitive adaptation characteristics.
[0006] To achieve the above objectives, this application proposes an age-friendly video script generation method based on multimodal cognition, comprising:
[0007] A professional knowledge-age cognitive adaptation assessment system was constructed and a gold standard sample library was established to obtain sample data labeled with adaptation scores. The adaptation assessment system includes a five-dimensional quantitative indicator system and a method for calculating the total adaptation score. The five-dimensional quantitative indicator system includes conceptual complexity dimension score, language affinity dimension score, logical clarity dimension score, life relevance dimension score and memory friendliness dimension score.
[0008] Based on the sample data, text modal features and cognitive adaptation features are extracted, wherein the text modal features include medical ontology features and linguistic features, and the cognitive adaptation features include colloquialization conversion features, logical conversion features, and metaphor system features;
[0009] The text modal features and the cognitive adaptation features are input into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script. The preset multimodal deep network model includes a text encoder, a cognitive feature encoder, a feature fusion layer, a score prediction head, and a script generation decoder. The text encoder adopts a 3-layer first Transformer encoder, and each first Transformer encoder uses 8 attention heads and a feedforward neural network. The cognitive feature encoder adopts a 2-layer second Transformer encoder, and each second Transformer encoder uses 4 attention heads and a feedforward neural network. The score prediction head adopts a 2-layer fully connected network. The script generation decoder adopts a constraint search strategy.
[0010] Based on the comparison between the predicted adaptation score and the preset adaptation score threshold, the age-friendly video script is iteratively optimized or output to obtain the target age-friendly video script.
[0011] In one embodiment, the step of constructing a professional knowledge-age cognitive adaptation assessment system and establishing a gold standard sample library to obtain sample data labeled with adaptation scores includes:
[0012] Define the mathematical expression of the five-dimensional quantitative indicator system;
[0013] The analytic hierarchy process is used to determine the preset dimension weight vectors of the five-dimensional quantitative index system, and the adaptation total score is calculated based on the preset dimension weight vectors and the score vectors of each dimension.
[0014] An annotation team composed of geriatric medicine experts, linguistics experts, and elderly user representatives was formed to conduct five-dimensional scoring and adaptation total score annotation on a predetermined number of original professional text and manually converted script sample pairs to obtain initial annotation results;
[0015] Calculate the intragroup correlation coefficient of the initial annotation results. When the intragroup correlation coefficient is less than a preset consistency threshold, remove the corresponding initial annotation results, integrate the remaining initial annotation results, establish the gold standard sample library, and obtain the sample data.
[0016] In one embodiment, the step of extracting text modality features and cognitive adaptation features based on the sample data includes:
[0017] Medical entity recognition is performed on the original professional health text in the sample data to extract medical entities such as disease name, symptom description, treatment plan and medication guidelines, and medical entity relationship triples are constructed to obtain medical ontological features;
[0018] The original professional health text was segmented, part-of-speech tagging was performed, and syntactic analysis was conducted to extract the density of professional terms, average sentence length, frequency of causal words, and proportion of passive voice to obtain language features;
[0019] By concatenating the medical ontology features with the language features, text modal features are obtained;
[0020] The manual conversion scripts in the sample data are subjected to colloquial pattern matching to extract the proportion of modal particles, short sentences, and everyday expressions to obtain colloquial conversion features.
[0021] The logical structure of the manually converted script is analyzed to extract the proportion of step-by-step expressions, the proportion of total-to-score structure, and the proportion of key information presented in the beginning, thus obtaining logical conversion features;
[0022] The artificially converted script is subjected to metaphor expression recognition, and the mapping relationship between medical concepts and life scenarios and the frequency of metaphors are extracted to obtain the metaphor system features;
[0023] The colloquialization conversion feature, the logical conversion feature, and the metaphor system feature are concatenated to obtain the cognitive adaptation feature.
[0024] In one embodiment, before the step of inputting the text modality features and the cognitive adaptation features into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script, the following steps are included:
[0025] The sample data of a preset number of labeled samples in the gold standard sample library are retrieved as the training set and the validation set, and the validation set includes the real results labeled in the gold standard sample library;
[0026] Construct an initial multimodal deep network model;
[0027] Design a multi-task loss function, which is a weighted sum of the mean squared error of the total adaptive score, the mean absolute error of the dimension score, the fluency penalty term, and the cognitive constraint regularization term.
[0028] Based on the first training phase of the initial multimodal deep network model, synonym replacement, sentence rearrangement, and professional terminology replacement are performed on professional health texts, and the model is trained for a preset first number of iterations to obtain the first optimized model.
[0029] The second training phase is performed based on the first optimization model. A reward function that integrates adaptation score, semantic similarity and generation diversity is designed. The second optimization model is trained with a preset batch size and a preset number of iterations to obtain the second optimization model.
[0030] The training set is input into the second optimization model to obtain the prediction result;
[0031] The loss value is calculated based on the multi-task loss function, the prediction result, and the actual result.
[0032] Until the loss value converges or the maximum preset number of iterations is reached, the current second optimization model is used as the preset multimodal deep network model.
[0033] In one embodiment, the step of inputting the text modality features and the cognitive adaptation features into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script includes:
[0034] The text modal features are input into a text encoder, and deep semantic representations are extracted through a multi-layer self-attention mechanism and a feedforward neural network to obtain text encoding features;
[0035] The cognitive adaptation features are input into the cognitive feature encoder, and the cognitive pattern representation is extracted through the Transformer architecture to obtain the cognitive encoding features;
[0036] The text encoding features and the cognitive encoding features are input into the feature fusion layer for cross-modal space alignment and fusion processing to obtain multimodal fusion features;
[0037] The multimodal fusion features are input into the scoring prediction head, and the five-dimensional scores and the total adaptation score are predicted by regression through a two-layer fully connected network to obtain the predicted adaptation score.
[0038] The multimodal fusion features are input into the script generation decoder, and an age-appropriate video script is generated through a constraint search strategy. The constraints of the constraint search include sentence length constraints, terminology constraints, repetition constraints, and metaphor constraints.
[0039] The step of inputting the text encoding features and the cognitive encoding features into the feature fusion layer for cross-modal space alignment and fusion processing to obtain multimodal fused features includes:
[0040] Obtain the first dimension parameter of the text encoding feature and the second dimension parameter of the cognitive encoding feature, and project the text encoding feature and the cognitive encoding feature onto a preset unified dimension space through linear transformation to obtain the first feature and the second feature;
[0041] Calculate the cosine similarity between the first feature and the second feature, and maximize the mutual information between the first feature and the second feature under the same sample using InfoNCE loss to obtain the semantically aligned first feature and the semantically aligned second feature.
[0042] The semantically aligned first feature is used as the query vector, and the semantically aligned second feature is used as the key vector and value vector, respectively. These are then input into a cross-attention mechanism for feature interaction calculation to obtain the output feature.
[0043] The output features are normalized and dimensionality compressed to obtain the initial fused features;
[0044] The initial fused features are input into a feedforward neural network for feature depth extraction, which strengthens the correlation representation between features and obtains multimodal fused features.
[0045] In one embodiment, the step of iteratively optimizing or outputting the age-appropriate video script based on the comparison between the predicted adaptation score and the preset adaptation score threshold to obtain the target age-appropriate video script includes:
[0046] The predicted adaptation score is compared with the preset adaptation score threshold to obtain the comparison result;
[0047] When the comparison result is that the predicted adaptation score is greater than or equal to the preset adaptation score threshold, visual cue markers and virtual character parameters are added to the age-friendly video script to obtain the first target age-friendly video script. The visual cue markers are chart insertion position indicators, and the virtual character parameters include facial expression parameters, speech rate parameters, and clothing parameters.
[0048] When the comparison result is that the predicted adaptation score is less than the preset adaptation score threshold, the target optimization dimension is identified based on the low-scoring dimension among the five dimensions, and the cognitive adaptation feature weights corresponding to the target optimization dimension are adjusted to obtain the adjusted cognitive adaptation features.
[0049] The adjusted cognitive adaptation features and the text modality features are then re-aligned and fused across modal space to obtain optimized fused features;
[0050] The optimized fusion features are input into the preset multimodal deep network model to regenerate the age-appropriate video script, thus obtaining the iterative optimization script;
[0051] Until the predicted adaptation score of the iterative optimization script is greater than or equal to the preset adaptation score threshold or the number of iterations reaches the preset maximum number of iterations, the iterative optimization script will be used as the second target aging-adaptive video script.
[0052] The first target age-friendly video script or the second target age-friendly video script is used as the target age-friendly script.
[0053] Furthermore, to achieve the above objectives, this application also proposes an age-friendly video script generation system based on multimodal cognition, wherein the age-friendly video script generation system based on multimodal cognition includes:
[0054] The acquisition module is used to construct a professional knowledge-senior cognitive adaptation assessment system and establish a gold standard sample library to obtain sample data labeled with adaptation scores. The adaptation assessment system includes a five-dimensional quantitative indicator system and a method for calculating the total adaptation score. The five-dimensional quantitative indicator system includes a conceptual complexity dimension score, a language affinity dimension score, a logical clarity dimension score, a life relevance dimension score, and a memory friendliness dimension score.
[0055] The feature extraction module is used to extract text modality features and cognitive adaptation features based on the sample data, wherein the text modality features include medical ontology features and language features, and the cognitive adaptation features include colloquialization conversion features, logical conversion features and metaphor system features;
[0056] The prediction module is used to input the text modal features and the cognitive adaptation features into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script. The preset multimodal deep network model includes a text encoder, a cognitive feature encoder, a feature fusion layer, a score prediction head, and a script generation decoder. The text encoder adopts a 3-layer first Transformer encoder, and each first Transformer encoder uses 8 attention heads and a feedforward neural network. The cognitive feature encoder adopts a 2-layer second Transformer encoder, and each second Transformer encoder uses 4 attention heads and a feedforward neural network. The score prediction head adopts a 2-layer fully connected network. The script generation decoder adopts a constraint search strategy.
[0057] The results module is used to iteratively optimize or output the age-friendly video script based on the comparison between the predicted adaptation score and the preset adaptation score threshold, so as to obtain the target age-friendly video script.
[0058] In one embodiment, the acquisition module is further configured to define the mathematical expression of the five-dimensional quantitative indicator system; determine the preset dimension weight vector of the five-dimensional quantitative indicator system using the analytic hierarchy process (AHP); calculate the total fitting score based on the preset dimension weight vector and the score vector of each dimension; assemble an annotation team composed of geriatric medicine experts, linguistics experts, and elderly user representatives to perform five-dimensional scoring and total fitting score annotation on a preset number of original professional text and manually converted script sample pairs to obtain initial annotation results; calculate the intra-group correlation coefficient of the initial annotation results; when the intra-group correlation coefficient is less than a preset consistency threshold, remove the corresponding initial annotation results; integrate the remaining initial annotation results; establish the gold standard sample library; and obtain the sample data.
[0059] In one embodiment, the prediction module is further configured to: input the text modal features into a text encoder, extract deep semantic representations through a multi-layer self-attention mechanism and a feedforward neural network to obtain text-encoded features; input the cognitive adaptation features into a cognitive feature encoder, extract cognitive pattern representations through a Transformer architecture to obtain cognitive-encoded features; input the text-encoded features and the cognitive-encoded features into a feature fusion layer for cross-modal space alignment and fusion processing to obtain multimodal fusion features; input the multimodal fusion features into a scoring prediction head, regress and predict five-dimensional scores and total adaptation scores through a two-layer fully connected network to obtain a predicted adaptation score; and input the multimodal fusion features into a script generation decoder, generate an age-appropriate video script through a constraint search strategy, wherein the constraints of the constraint search include sentence length constraints, terminology constraints, repetition constraints, and metaphor constraints.
[0060] This application constructs a professional knowledge-age cognitive adaptation assessment system with five-dimensional quantitative indicators and establishes a gold standard sample library. Then, it extracts text and cognitive adaptation multimodal features and inputs them into a preset multimodal deep network to generate adaptation scores and scripts. Finally, it iteratively optimizes or outputs target scripts based on score thresholds, realizing the automated and standardized generation of age-friendly video scripts. Multimodal fusion and constrained generation take into account both medical accuracy and cognitive adaptation features, improving the efficiency of script generation and content adaptation for elderly users. Attached Figure Description
[0061] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0062] Figure 1 This is a flowchart illustrating the first embodiment of the age-friendly video script generation method based on multimodal cognition in this application;
[0063] Figure 2 This is a flowchart illustrating the second embodiment of the age-friendly video script generation method based on multimodal cognition in this application;
[0064] Figure 3 This is a schematic diagram of the module structure of the age-friendly video script generation system based on multimodal cognition in this application;
[0065] Figure 4 This is a schematic diagram of the device structure of the hardware operating environment involved in the aging-friendly video script generation method based on multimodal cognition in the embodiments of this application.
[0066] The purpose, features, and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0067] It should be understood that the specific embodiments described herein are merely illustrative of the technical solutions of this application and are not intended to limit this application.
[0068] To better understand the technical solution of this application, a detailed description will be provided below in conjunction with the accompanying drawings and specific implementation methods.
[0069] Currently, the industry's approach to transforming professional health knowledge into age-friendly video scripts primarily relies on manual translation, supplemented by simple text simplification techniques. On one hand, collaborative teams comprised of geriatric medicine experts, science writers, and content designers manually interpret and rewrite professional health texts such as medical guidelines, academic papers, and disease encyclopedias, translating professional expressions into language easily understood by the elderly. Simultaneously, they manually design the logical structure and conversational style of the video scripts. On the other hand, some platforms employ general text simplification models, such as using pre-trained language models to replace vocabulary and adjust sentence structures in professional health texts, achieving simple popularization. Subsequent script adaptation and optimization are then performed manually. Furthermore, in the production of existing health science content, some platforms conduct simple qualitative evaluations of content in elderly-specific sections, judging whether the content aligns with the reading and comprehension habits of the elderly based on subjective experience. A few studies have attempted basic quantitative analysis of the text simplification effect, but a systematic adaptation evaluation standard has not yet been established.
[0070] Existing practices have revealed several significant shortcomings in practical applications: there is a serious disconnect between professional health knowledge and cognitive suitability characteristics; the subjective nature of manual conversion is strong; and general text simplification models only focus on superficial modifications of a single text modality without tailoring them to the cognitive patterns of the elderly. The generated content still suffers from problems such as dense technical jargon, excessively long logical chains, and too many abstract concepts, making it difficult for the elderly to understand. The manual conversion-based model is extremely inefficient, with the production cycle for a single age-friendly video script taking several weeks. Furthermore, the varying levels of expertise and experience among different production teams result in inconsistent script quality, making it difficult to establish unified production standards. There is a lack of a quantifiable professional knowledge-age cognitive suitability assessment system. Existing evaluations are mostly qualitative descriptions, lacking calculable quantitative indicators for key suitability dimensions such as conceptual complexity and language affinity. This makes it impossible to objectively measure the conversion effect and to systematically optimize the script generation process. Therefore, how to construct a quantitative professional knowledge-age cognitive adaptation assessment system, integrate multimodal features to achieve the automated generation of professional health knowledge into video scripts that conform to cognitive adaptation features, and at the same time take into account the medical accuracy of the content and the cognitive adaptability of the elderly, has become an urgent problem to be solved.
[0071] Based on the above, this application also provides a method for generating age-friendly video scripts based on multimodal cognition, referring to... Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the age-friendly video script generation method based on multimodal cognition in this application.
[0072] In this embodiment, the age-friendly video script generation method based on multimodal cognition includes steps S10 to S40:
[0073] Step S10: Construct a professional knowledge-age cognitive adaptation assessment system and establish a gold standard sample library to obtain sample data labeled with adaptation scores.
[0074] It should be noted that the adaptation assessment system includes a five-dimensional quantitative indicator system and a method for calculating the total adaptation score. The five-dimensional quantitative indicator system includes scores for conceptual complexity, language affinity, logical clarity, relevance to daily life, and memory friendliness.
[0075] Further, step S10 includes: First, defining the mathematical expression of the five-dimensional quantitative indicator system. The concept complexity dimension score is calculated based on the proportion of technical terms and the proportion of abstract concepts. The specific concept complexity dimension score... The formula is:
[0076]
[0077] Specialized terms refer to professional vocabulary with specific meanings in the medical and health field. These terms are usually derived from medical guidelines, academic papers, or professional textbooks, and are difficult for ordinary elderly users to understand directly. Specialized terms are identified through matching with the UMLS medical terminology database. By comparing the words in the text with entries in the UMLS database, the number of specialized terms appearing in the text is determined. Abstract concepts refer to concepts that do not directly correspond to specific things but rather summarize the essential characteristics of a class of things, such as disease names like "diabetes" and "hypertension," or pathological descriptions like "metabolic disorder" and "decreased immune function." Abstract concepts are determined using the BERT abstraction classification model. A trained BERT model scores the abstraction level of concepts in the text; when the abstraction score exceeds a preset threshold, the concept is considered abstract. The BERT abstraction classification model is a pre-trained language model based on the Transformer architecture. After specialized training, it can quantitatively evaluate the degree of abstraction of concepts in text. In this embodiment, abstract concepts are defined by setting the threshold of the BERT abstraction classification model to... determination.
[0078] The language affinity dimension score is calculated by weighting colloquialism score, dialect suitability, and speech rate suitability. Specific language affinity dimension scores are as follows: The formula is:
[0079]
[0080] The colloquialism score is calculated by matching the n-gram colloquial pattern library (containing 2000+ n-grams of daily conversations among the elderly). The dialect fit is calculated by using Chinese dialect recognition models (such as Cantonese-BERT and Sichuanese-BERT) to calculate the cosine similarity between the target dialect and the script. The speech rate fit is measured by the matching degree between the sentence length distribution and the optimal auditory sentence length (8-15 characters) for the elderly.
[0081] The logical clarity dimension score is calculated based on a weighted average of causal explicitness and step coherence. The specific logical clarity dimension score... The formula is:
[0082]
[0083] The explicitness of causality is calculated by the density of causal relation words (such as "because...therefore...", "leading to", "due to"), and the coherence of steps is measured by the entropy value of the Markov chain transition probability of the event time series (the lower the entropy value, the higher the coherence).
[0084] The relevance to daily life dimension score is calculated by weighting the relevance of the metaphor and the relevance to the scene. The specific relevance to daily life dimension score is as follows: The formula is:
[0085]
[0086] The metaphor closeness is calculated by WordNet using the semantic distance between "medical concept" and "life concept" (the smaller the distance, the higher the closeness). The scene relevance is calculated by scene classification models (such as SceneBERT) to calculate the matching probability between the script and high-frequency life scenes of the elderly (such as "grocery shopping", "cooking", "taking a walk").
[0087] The memory-friendliness dimension score is calculated based on a weighted average of key information repetition and visual cue fit. Specific memory-friendliness dimension scores... The formula is:
[0088]
[0089] The repetition of key information is calculated by the interval distribution of TF-IDF keywords in the script (ideal interval ≤ 3 sentences); the visual cue adaptability is calculated by the matching degree between the script content and the suggested visual elements (such as "blood glucose curve" and "insulin working diagram") through a text-visual association model (such as CLIP).
[0090] Next, the Analytic Hierarchy Process (AHP) is used to determine the preset dimension weight vectors of the five-dimensional quantitative indicator system. The overall fitting score is then calculated based on these preset dimension weight vectors and the score vectors of each dimension. Specifically, a judgment matrix construction team composed of geriatric medicine experts, linguistics experts, and elderly user representatives is formed to obtain the preset dimension weight vectors. In this embodiment, a preset dimension weight vector was determined through an expert questionnaire (10 geriatric medicine experts, 5 linguistics experts, and 15 elderly user representatives). Combining the score vectors of each dimension The adaptive total score calculation method is obtained, specifically expressed as follows:
[0091]
[0092] in This indicates the current number of dimensions, including five core dimensions: conceptual complexity, language affinity, logical clarity, relevance to daily life, and memory friendliness. .
[0093] Then, an annotation team composed of geriatric medicine experts, linguistics experts, and elderly user representatives was formed to perform five-dimensional scoring and total matching score annotation on a preset number of original professional text and manually converted script sample pairs to obtain initial annotation results; then, the intra-group correlation coefficient of the initial annotation results was calculated, and when the intra-group correlation coefficient was less than the preset consistency threshold, the corresponding initial annotation results were removed, the remaining initial annotation results were integrated, and a gold standard sample library was established to obtain sample data. Specifically, the composition of the annotation team was determined, selecting 3 geriatric medicine experts, 2 linguistics experts, and 5 representatives of elderly users aged 60 to 80, forming a 10-person annotation team. A training manual for the five-dimensional scoring criteria was developed, and the annotation team received training on the consistency of the scoring criteria, enabling the annotators to master the scoring details for the dimensions of conceptual complexity, language affinity, logical clarity, relevance to daily life, and memory friendliness, resulting in a trained annotation team. Next, a preset number of original professional health text samples (500 in this embodiment) were retrieved from the medical knowledge base, and corresponding manual conversion scripts were collected, forming original professional text and manual conversion script sample pairs. These sample pairs were then assigned to the trained annotation team, with each annotator assigned a specific sample. The obtained sample pairs are independently scored in five dimensions and labeled with a total fit score to obtain initial labeling results. Then, the intra-group correlation coefficient of the initial labeling results is calculated. The intra-group correlation coefficient is calculated by dividing the variance between objects by the sum of the variance between objects, the variance between raters, and the variance of random error. Finally, the intra-group correlation coefficient value is compared with a preset consistency threshold. If the intra-group correlation coefficient value is less than the preset consistency threshold (0.85), the corresponding initial labeling result is removed and relabeled. If the intra-group correlation coefficient value is greater than or equal to the preset consistency threshold (0.85), the scoring results of all labelers are integrated, and the average five-dimensional score and average total fit score of each sample pair are calculated to establish a gold standard sample library. ,in This indicates the original professional text (such as the medical paragraph "Pathological Mechanisms of Type 2 Diabetes"). This indicates an age-friendly script that was manually converted. This represents a rating vector with 5 dimensions. This indicates the total score for adaptation, and then sample data is obtained from the gold standard sample library.
[0094] Step S20: Extract text modality features and cognitive adaptation features based on sample data.
[0095] It should be noted that text modal features include medical ontological features and linguistic features, while cognitive adaptation features include colloquialization transformation features, logical transformation features, and metaphor system features.
[0096] Further, step S20 includes: performing medical entity recognition on the original professional health text in the sample data, extracting medical entities such as disease names, symptom descriptions, treatment plans, and medication guidelines, constructing medical entity relationship triples, and obtaining medical ontology features; performing word segmentation, part-of-speech tagging, and syntactic analysis on the original professional health text, extracting professional terminology density, average sentence length, frequency of causal relation words, and passive voice proportion, and obtaining language features; concatenating the medical ontology features with the language features to obtain text modality features; performing colloquial pattern matching on the manually converted scripts in the sample data, extracting the proportion of modal particles, short sentences, and everyday expressions, and obtaining colloquial conversion features; performing logical structure analysis on the manually converted scripts, extracting the proportion of step-by-step expressions, total-to-part structure proportion, and key information pre-positioning proportion, and obtaining logical conversion features; performing metaphor expression recognition on the manually converted scripts, extracting the mapping relationship between medical concepts and life scenarios and metaphor frequency, and obtaining metaphor system features; concatenating the colloquial conversion features, logical conversion features, and metaphor system features to obtain cognitive adaptation features.
[0097] Specifically, firstly, medical natural language processing tools are used to perform medical entity recognition on the original professional health text in the sample data, identifying disease name entities, symptom description entities, treatment plan entities, and medication guidelines entities. Semantic type labels, entity boundary positions, and inter-entity relationships of the medical entities are extracted, and a triple structure of medical concept-relationship-medical concept is constructed. The average depth and density of medical entities in the semantic hierarchy tree of the unified medical language system are calculated to obtain medical ontological features. This is done to transform unstructured medical text into a structured semantic representation and capture the ontological structure of medical knowledge. Secondly, the original professional health text was processed into Chinese word segmentation to obtain a word segment sequence. Part-of-speech tagging was performed on the word segment sequence to obtain a part-of-speech tag sequence. Syntactic dependency analysis was then performed on the part-of-speech tag sequence to obtain a syntactic tree structure. Based on the medical thesaurus, the proportion of professional terms in the total vocabulary was statistically analyzed to obtain the professional term density. The ratio of the number of characters to the number of sentences was calculated to obtain the average sentence length. The frequency of causal relation conjunctions in the total vocabulary was statistically analyzed to obtain the causal relation frequency. The proportion of passive voice markers in the total vocabulary was statistically analyzed to obtain the passive voice proportion. The professional term density, average sentence length, causal relation frequency, and passive voice proportion were integrated to obtain linguistic features. This was done to quantify the linguistic attributes of the text and identify factors affecting the linguistic complexity of language comprehension in the elderly. Then, the medical ontological features and linguistic features were concatenated end-to-end according to feature dimensions to obtain the text modal features.
[0098] Next, the manually converted scripts in the sample data are subjected to colloquial vocabulary pattern matching. The proportion of modal particles in the total vocabulary is calculated to obtain the modal particle proportion. The proportion of sentences with fewer than a preset short sentence threshold (8 in this embodiment) in the total sentences is calculated to obtain the short sentence proportion. The proportion of everyday vocabulary in the total vocabulary is calculated to obtain the everyday expression proportion. The proportions of modal particle, short sentence, and everyday expression are integrated to obtain the colloquial conversion features. This is done to quantify the script's adaptability to the daily spoken language habits of the elderly. Then, the manually converted scripts are analyzed for paragraph logical structure. Step-by-step expression markers are identified and the proportion of step-by-step paragraphs is calculated to obtain the step-by-step expression proportion. The total-to-part structure markers are identified and the proportion of total-to-part structure paragraphs is calculated to obtain the total-to-part structure proportion. The location of key information is identified and the proportion of key information at the beginning of sentences is calculated to obtain the key information preposition proportion. The proportions of step-by-step expression, total-to-part structure, and key information preposition proportion are integrated to obtain the logical conversion features. This is done to capture the logical organization of the script and assess its adaptability to the cognitive processing preferences of the elderly. Next, the metaphor recognition model is invoked to identify metaphorical expressions in the manually converted script. Mapping pairs are extracted, with medical concepts as the target domain and everyday scene concepts as the source domain. The number of mapping pairs is calculated to obtain metaphor frequency. The mapping pairs and metaphor frequency are then integrated to obtain metaphor system features. Finally, the colloquial conversion features, logical conversion features, and metaphor system features are concatenated along their feature dimensions to obtain cognitive adaptation features. This is done to form a unified cognitive representation vector, which is then aligned and fused with the text modality features across spatial dimensions.
[0099] Step S30: Input the text modality features and cognitive adaptation features into the preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script.
[0100] It should be noted that the preset multimodal deep network model includes a text encoder, a cognitive feature encoder, a feature fusion layer, a rating prediction head, and a script generation decoder. The preset multimodal deep network model will be referred to as the multimodal model of this embodiment. The text encoder adopts a 3-layer first Transformer encoder, and each first Transformer encoder uses 8 attention heads and a feedforward neural network. The cognitive feature encoder adopts a 2-layer second Transformer encoder, and each second Transformer encoder uses 4 attention heads and a feedforward neural network. The rating prediction head adopts a 2-layer fully connected network, and the script generation decoder adopts a constrained search strategy.
[0101] Further, step S30 includes: First, inputting the text modality features into a text encoder, and extracting deep semantic representations through a multi-layer self-attention mechanism and a feedforward neural network to obtain text-encoded features. Specifically, the text modality features are input into the text encoder, and a linear transformation is used to map the 908-dimensional text modality features to a 512-dimensional encoder input dimension to obtain an initial embedding vector. This is done to match the internal dimensionality requirements of the encoder and prepare for subsequent deep processing. Second, the initial embedding vector is input into the first layer of a 3-layer Transformer encoder block, and the attention weights of the query vector, key vector, and value vector are calculated through a multi-head self-attention mechanism. Eight-head parallel attention calculations are performed on the 512-dimensional input, with each head processing a 64-dimensional subspace. The outputs of the eight heads are concatenated and linearly transformed to obtain a 512-dimensional attention output. This is done to capture the global dependencies within the input features and identify the correlations between different feature dimensions. Then, the attention output is residually concatenated with the initial embedding vector and subjected to layer normalization to obtain... The normalized output is then input into a feedforward neural network. The first layer expands to 2048 dimensions and undergoes a nonlinear transformation using a rectified linear unit activation function. The second layer compresses it back to 512 dimensions. The feedforward output and the normalized output are then residually concatenated and layer-normalized again to obtain the first layer of encoded features. This is done to enhance the feature representation through nonlinear transformation and to avoid gradient vanishing using residual connections. Next, the first layer of encoded features is sequentially input into the remaining two Transformer encoder blocks. The multi-head self-attention calculation, residual connection, layer normalization, feedforward neural network, and residual connection operations are repeatedly performed to extract higher-level abstract semantic representations layer by layer, resulting in the third layer of encoded output. This is done to gradually refine semantic information through deep stacking, forming a hierarchical feature representation. Finally, the third layer of encoded output is used as the text encoded features.
[0102] Next, the cognitive adaptation features are input into the cognitive feature encoder, and cognitive pattern representations are extracted using the Transformer architecture to obtain cognitive encoded features. Specifically, the cognitive adaptation features are input into the embedding layer of the cognitive feature encoder, and a linear transformation is used to map the 205-dimensional cognitive adaptation features to the 256-dimensional encoder input dimension to obtain the initial cognitive embedding vector. This is done to match the internal dimensionality requirements of the cognitive feature encoder and reduce the feature dimension to reduce subsequent computational complexity. Secondly, the initial cognitive embedding vector is input into the first layer of a two-layer Transformer encoder block. A multi-head self-attention mechanism is used to calculate the attention weights of the query vector, key vector, and value vector. Four-head parallel attention computation is performed on the 256-dimensional input, with each head processing a 64-dimensional subspace. The outputs of the four heads are concatenated and linearly transformed to obtain the 256-dimensional cognitive attention output. This is done to capture the global dependencies within the cognitive adaptation features and identify the association patterns between colloquialisms, logical structures, and metaphorical systems. Finally, the cognitive attention output is residually concatenated with the initial cognitive embedding vector and then... The process involves layer normalization to obtain a cognitively normalized output. This output is then input into a feedforward neural network. The first layer is expanded to 1024 dimensions and undergoes a nonlinear transformation using a rectified linear unit activation function. The second layer is compressed back to 256 dimensions. The feedforward output and the cognitively normalized output are then residually concatenated and layer normalized again to obtain the first layer of cognitive encoding features. This is done to enhance the expressive power of cognitive features through nonlinear transformation and to stabilize the training process using residual connections. Next, the first layer of cognitive encoding features is input into the second layer Transformer encoder block. The processes of 4-head self-attention computation, residual connection, layer normalization, feedforward neural network, and residual connection are repeatedly executed to extract a higher-level cognitive abstraction representation, resulting in the second layer of cognitive encoding output. This is done to refine the high-level semantics of cognitive patterns through deep stacking, forming a deep understanding of the cognitive preferences of the elderly. Finally, the second layer of cognitive encoding output is used as the cognitive encoding feature.
[0103] Next, the text-encoded features and the cognitive-encoded features are input into the feature fusion layer. A bidirectional cross-attention mechanism is used to achieve deep interaction between medical semantics and cognitive patterns, resulting in multimodal fusion features. Specifically, the first-dimensional parameters of the text-encoded features and the second-dimensional parameters of the cognitive-encoded features are first obtained. Then, the text-encoded features and cognitive-encoded features are projected onto a preset unified dimensional space through linear transformations to obtain the first feature and the second feature. Specifically, the linear transformations are then used to... (908 dimensions) and (205-dimensional) projection onto 512-dimensional space:
[0104]
[0105] in , For the projection matrix, This is a bias term.
[0106] Next, the cosine similarity between the first and second features is calculated, and the mutual information between the first and second features in the same sample is maximized using the InfoNCE loss function to obtain the semantically aligned first and second features. Specifically, the cosine similarity between the first and second features is calculated by dividing the dot product of the two feature vectors by the product of their respective magnitudes to obtain the similarity score. The mutual information between the first and second features in the same sample is maximized using the InfoNCE loss function, which treats the current sample as a positive sample and other samples as negative samples within a batch. The contrastive learning loss is calculated, and the projection parameters are updated through backpropagation, so that semantically related sample features are closer together and irrelevant sample features are farther apart. The specific formula for calculating mutual information is as follows:
[0107]
[0108] in, For cosine similarity, =0.1 represents the temperature parameter. For negative sample indexing, semantically aligned first and second features are obtained. This is done to force text modality features and cognitive adaptation features to form a consistent representation in the semantic space, establishing cross-modal associations between professional knowledge and cognitive patterns. Then, the semantically aligned first feature is used as the query vector, and the semantically aligned second feature is used as the key and value vectors, respectively. These are input into a cross-attention mechanism for feature interaction calculation to obtain the output feature. Specifically, the semantically aligned first feature is used as the query vector, and the semantically aligned second feature is used as the key and value vectors, respectively. The dot product of the query vector and the key vector is calculated, scaled, and normalized to obtain the attention weight matrix. The attention weight matrix is multiplied by the value vector to obtain a weighted output, which is then input into the cross-attention mechanism for feature interaction calculation to obtain the output feature. This is done to allow text modality features to actively pay attention to relevant information in cognitive adaptation features, achieving bidirectional cross-modal information interaction. The output feature is then normalized and dimensionality compressed to obtain the initial fused feature. Finally, the initial fused feature is input into a feedforward neural network for feature depth extraction, strengthening the association representation between features to obtain the multimodal fused feature. Specifically, the output features are normalized to stabilize the feature distribution. A fully connected layer compresses the feature dimension from 512 to 256 to obtain the initial fused features. This is done to reduce the feature dimensionality and enhance the model's generalization ability. Finally, the initial fused features are input into a two-layer feedforward neural network. The first layer expands the 256 dimensions to 1024 dimensions and performs a non-linear transformation using an activation function. The second layer compresses the features back to 512 dimensions, strengthening the correlation representation between features, thus obtaining the multimodal fused features. This is done in order to extract higher-order combinatorial features through nonlinear transformations, thereby enhancing the model's ability to model complex semantic relationships.
[0109] Then, the multimodal fusion features are input into the rating prediction head, and the five-dimensional scores and the total fit score are predicted through a two-layer fully connected network to obtain the predicted fit score. Specifically, the multimodal fusion features are input into the first layer of the rating prediction head's fully connected network. The 512-dimensional multimodal fusion features are compressed to 256 dimensions and mapped to the hidden layer through a weight matrix. After adding a bias vector, the feature distribution is stabilized by batch normalization, and then a nonlinear transformation is performed through a rectified linear unit activation function to obtain the output of the first hidden layer. This is done to extract high-order feature combinations related to rating prediction, reduce feature dimensionality, and enhance nonlinear expressive power. Next, the output of the first hidden layer is input into the second layer of the rating prediction head's fully connected network. The 256 dimensions are mapped to a 6-dimensional output layer through a weight matrix. After adding a bias vector, the original prediction vector is obtained. The first 5 dimensions correspond to the original scores of the conceptual complexity dimension, language affinity dimension, logical clarity dimension, life relevance dimension, and memory friendliness dimension, and the 6th dimension corresponds to the original scores of the other dimension. The model first outputs the original scores for the five dimensions corresponding to the total fitting score. Then, it maps the original scores for the first five dimensions to the 0-1 range using a sigmoid activation function, multiplies them by a preset scaling factor, and adds a preset offset to convert the scores to a standard scoring range of 1-5, obtaining the predicted scores for the five dimensions. This is done to convert the model output to a scoring range consistent with human annotation standards. Next, it maps the original score for the sixth dimension to the 0-1 range using a sigmoid activation function, multiplies it by a preset total score scaling factor, and adds a preset total score offset to obtain the predicted total fitting score. This ensures that the predicted total fitting score maintains numerical consistency with the weighted calculation results of the five dimensions. Finally, it combines the predicted scores for the five dimensions and the predicted total fitting score to form the predicted fitting score.
[0110] Finally, the multimodal fusion features are input into the script generation decoder, and an age-appropriate video script is generated through a constraint search strategy. The constraints of the constraint search include sentence length constraints, term constraints, repetition constraints, and metaphor constraints. Specifically, the multimodal fusion features are input into the script to generate the decoder's initial state. The beam search width is set to 5, and 5 candidate sequences are initialized. Each candidate sequence uses the starting marker as its initial input. This is done to retain multiple candidate paths during the generation process, improving generation quality. Secondly, at each step of decoding, each candidate sequence is expanded. The conditional probability of each word in the vocabulary as the next word is calculated using a Softmax layer to obtain the candidate word probability distribution. This is done to obtain the generation probability in the current state. Then, sentence length constraint filtering is applied to each candidate word. The number of characters generated in the current candidate sequence is counted. If adding a candidate word results in a sentence length exceeding a preset sentence length threshold (15 in this embodiment), the probability of that candidate word is set to zero, and it is removed. This is done to ensure that the length of each generated script sentence does not exceed 15 characters, which is suitable for the sentence length tolerance of elderly users. Next, terminology constraint filtering is applied to the candidate words that pass the sentence length constraint. The number of technical terms in the current candidate sequence is counted. If adding a candidate word results in a technical term exceeding a preset terminology threshold (5 in this embodiment), the probability of that candidate word is reduced. The candidate word is first zeroed out to control the density of technical terms and avoid excessive comprehension. Then, a repetition constraint is applied to the candidate words that pass the terminology constraint. The number of sentences since the last repetition of key information is counted. If the number of sentences between adding the candidate word and the last key information exceeds a preset interval threshold (3 in this example), the probability of the candidate word is zeroed out. This ensures that key information appears repeatedly within 3 sentences, strengthening the memory of elderly users. Next, a metaphor constraint check is performed on the candidate words that pass the repetition constraint to determine if the current paragraph already contains a metaphor. If not, and the candidate word is not a metaphorical expression, the probability weight of the candidate word is reduced, prioritizing metaphorical expressions. This ensures that each paragraph contains at least one metaphor, enhancing the vividness of the content. Finally, based on the candidate word probabilities after constraint filtering and the search score, the top 5 candidate sequences with the highest overall score are retained. The expansion and constraint filtering are repeated until an end marker is generated or the maximum length is reached. The sequence with the highest score is selected as the age-friendly video script. This is done to generate the optimal age-friendly video script while satisfying all cognitive constraints.
[0111] Step S40: Based on the comparison between the predicted adaptation score and the preset adaptation score threshold, the age-appropriate video script is iteratively optimized or output to obtain the target age-appropriate video script.
[0112] It should be noted that step S40 includes: comparing the predicted adaptation score with a preset adaptation score threshold to obtain a comparison result; when the comparison result shows that the predicted adaptation score is greater than or equal to the preset adaptation score threshold, adding visual cue markers and virtual character parameters to the age-friendly video script to obtain a first target age-friendly video script, wherein the visual cue markers are chart insertion position indicators, and the virtual character parameters include facial expression parameters, speech rate parameters, and clothing parameters; when the comparison result shows that the predicted adaptation score is less than the preset adaptation score threshold, identifying the target optimization dimension based on the low-scoring dimension among the five-dimensional scores, and adjusting the target optimization dimension. The corresponding cognitive adaptation feature weights are used to obtain adjusted cognitive adaptation features; the adjusted cognitive adaptation features and the text modality features are re-aligned and fused across modal space to obtain optimized fused features; the optimized fused features are input into the preset multimodal deep network model to regenerate an age-appropriate video script to obtain an iterative optimization script; until the adaptation score prediction value of the iterative optimization script is greater than or equal to the preset adaptation score threshold or the number of iterations reaches the preset maximum number of iterations, the iterative optimization script is used as the second target age-appropriate video script; the first target age-appropriate video script or the second target age-appropriate video script is used as the target age-appropriate script.
[0113] Specifically, the predicted adaptation score is compared with a preset adaptation score threshold (3.5 in this embodiment) to obtain a comparison result. This is done to determine whether the currently generated age-friendly video script meets the minimum requirements for cognitive adaptation in the elderly. Secondly, when the comparison result shows that the predicted adaptation score is greater than or equal to the preset adaptation score threshold, visual cues and virtual avatar parameters are added to the age-friendly video script. A chart insertion position indicator is inserted into the script as a visual cues, and the facial expression parameter is set to a smile or concern, the speech rate parameter to 120 words per minute, and the clothing parameter to comfortable home wear as virtual avatar parameters. This yields the first target age-friendly video script. This is done to enhance the multimodal presentation effect of the script and improve... The viewing experience and memory retention of elderly users are assessed. Then, when the predicted adaptation score is less than the preset adaptation score threshold (3.5 in this embodiment), the lowest score among the five dimensions is analyzed to identify the target optimization dimensions. When the conceptual complexity dimension score is the lowest, the weight coefficient of the medical ontology feature is reduced; when the language affinity dimension score is the lowest, the weight coefficient of the colloquialization feature is increased; when the logical clarity dimension score is the lowest, the weight coefficient of the logical transformation feature is increased; when the relevance to daily life dimension score is the lowest, the weight coefficient of the metaphor system feature is increased; and when the memory friendliness dimension score is the lowest, the weight coefficient of the key information repetition is increased. This yields the adjusted... The cognitive adaptation features are used to specifically strengthen the feature representation of low-scoring dimensions, guiding the model to generate scripts that better meet the requirements of that dimension. Next, the adjusted cognitive adaptation features and text modality features are re-inputted into the cross-modal alignment module, and the projection transformation and contrastive learning loss are recalculated to obtain re-aligned semantic features. These re-aligned semantic features are then input into the cross-attention fusion layer, where the attention weights and fusion output are recalculated to obtain optimized fusion features. This is done to re-establish the cross-modal association between text and cognition under the adjusted feature weights. Finally, the optimized fusion features are input into a pre-defined multimodal deep network model, and text encoding, cognitive encoding, bidirectional cross-attention fusion, and evaluation are re-executed. The prediction and constraint search decoding are performed separately to regenerate the age-appropriate video script, resulting in an iterative optimization script. This is done to generate an improved version of the script based on the optimized features. Finally, the predicted adaptation score of the iterative optimization script is compared again with the preset adaptation score threshold (3.5 in this embodiment). If the threshold is not met and the number of iterations has not reached the preset maximum number of iterations (5 in this embodiment), the target optimization dimension identification and weight adjustment operations are returned. If the threshold is met or the number of iterations reaches the preset maximum number of iterations, the current iterative optimization script is used as the second target age-appropriate video script. This is done to ensure that the final output script meets the adaptation requirements and to avoid infinite loops.Finally, the first target age-friendly video script or the second target age-friendly video script are unified as the target age-friendly script to complete the entire generation and optimization process.
[0114] Furthermore, to verify the effectiveness of the method proposed in this embodiment, systematic experimental verification was conducted. The experiment used a constructed professional knowledge-elderly cognitive adaptation dataset, containing 500 pairs of labeled samples. The samples came from three types of data: medical textbooks and guideline texts (including the 9th edition of *Internal Medicine* and *Guidelines for the Prevention and Treatment of Type 2 Diabetes in China*), manually created elderly health scripts (from the geriatrics department of a tertiary hospital and the "Sunset Red Health" column), and elderly user cognitive characteristic data (daily conversation data and memory test results of elderly people aged 60 to 80). The data covered areas such as chronic diseases (diabetes, hypertension, etc.), nutrition, and exercise rehabilitation, with text lengths ranging from 50 to 300 words per article. Four sets of comparative models were set up in the experiment: a BERT-based simplified model (text-only unimodal), a T5-small text generation model (end-to-end generation without cognitive constraints), a unimodal model (text features only, no cognitive adaptation features), and the model of this embodiment. Evaluation metrics included Pearson correlation coefficient, root mean square error, and semantic consistency. Table 1 shows the performance comparison results of each model.
[0115] Table 1. Performance Comparison Results of Each Model
[0116]
[0117] As can be seen, the model in this embodiment achieves a Pearson correlation coefficient prediction of 0.91, which is 33.8% higher than the BERT-base simplified model, 28.2% higher than the T5-small text generation model, and 11.0% higher than the unimodal model. The root mean square error is reduced to 0.32, which is 55.6%, 52.9%, and 28.9% lower than the comparison models, respectively. Semantic consistency reaches 0.48, significantly better than the comparison models. These results validate the effectiveness of multimodal cognitive feature fusion and contrastive learning alignment mechanisms in improving the accuracy of fit evaluation.
[0118] This embodiment constructs a professional knowledge-age cognitive adaptation assessment system with five-dimensional quantitative indicators and establishes a gold standard sample library. Then, it extracts text and cognitive adaptation multimodal features and inputs them into a preset multimodal deep network to generate adaptation scores and scripts. Finally, it iteratively optimizes or outputs target scripts based on score thresholds, realizing the automated and standardized generation of age-friendly video scripts. Multimodal fusion and constrained generation take into account both medical accuracy and cognitive adaptation features, improving the efficiency of script generation and content adaptation for elderly users.
[0119] Based on the first embodiment of this application, in the second embodiment of this application, the content that is the same as or similar to that in Embodiment 1 above can be referred to the above description, and will not be repeated hereafter. Based on this, please refer to... Figure 2 The method for generating age-friendly video scripts based on multimodal cognition further includes steps S201 to S208 before step S30:
[0120] Step S201: Retrieve sample data from the gold standard sample library with a preset number of labeled samples as the training set and validation set.
[0121] It should be noted that all 500 pairs of labeled sample data were retrieved from the gold standard sample library and divided into training and validation sets in an 8:2 ratio. The training set contains 400 pairs of sample data, and the validation set contains 100 pairs of sample data. The validation set includes the real results already labeled in the gold standard sample library; that is, each pair of sample data in the validation set includes the original professional health text, the manually converted script, the real values of the five-dimensional scores, and the real value of the total adaptation score. This is done to ensure that there is independent validation data for evaluating generalization performance during model training. Secondly, the 400 pairs of sample data in the training set were subjected to... Data augmentation operations include synonym replacement of medical terms based on a medical thesaurus, sentence reordering, and terminology replacement that retains core concepts, resulting in an augmented training set. This is done to expand the diversity of training data and improve the model's robustness to variations of medical text. Then, the augmented training set and the unaugmented validation set are stored in structured data formats. The training set is used for model parameter updates, and the validation set is used for performance evaluation and early stopping detection after each training round. This is done to establish a standardized training process, prevent model overfitting, and ensure the objectivity of the evaluation.
[0122] Step S202: Construct the initial multimodal deep network model.
[0123] Step S203: Design a multi-task loss function.
[0124] It should be noted that the multi-task loss function is a weighted sum of the mean squared error loss of the total adaptive score, the mean absolute error loss of the dimension scores, the fluency penalty term, and the cognitive constraint regularization term.
[0125] Specifically, the mean squared error loss of the fit total score is calculated by taking the difference between the predicted fit total score output by the scoring prediction head and the true fit total score in the validation set, squaring the difference, and averaging the squared differences over all training samples to obtain the mean squared error loss of the fit total score. The specific formula is as follows:
[0126]
[0127] in Represents the total number of samples. The current number of samples, This indicates the predicted total score for the adaptation. This represents the true value of the total fit score. Calculating the mean squared error loss of the total fit score is to quantify the model's prediction bias in the overall fit assessment, and to impose a higher penalty on larger errors.
[0128] Secondly, the mean absolute error loss of the dimensional scores is calculated. The difference between the predicted values of the five dimensions output by the score prediction head and the true values of the five dimensions in the validation set is calculated, and the absolute value is taken. The average of the five-dimensional absolute errors over all training samples is then used to obtain the mean absolute error loss of the dimensional scores. The specific formula is as follows:
[0129]
[0130] in Represents the total number of samples. The current number of samples, Indicates the first Predicted scores for each dimension Indicates the current number of dimensions. Indicates the first The true values of the scores for each dimension are calculated. The mean absolute error loss is used to independently evaluate the prediction accuracy of each dimension, making it easier to identify the optimization direction for specific dimensions.
[0131] Then, the smoothness penalty term is calculated. The age-appropriate video script generated by the script generation decoder is input into the perplexity calculation module, and the exponent of the negative logarithm of the conditional probability of the generated script is calculated to obtain the smoothness penalty term. The specific formula is as follows:
[0132]
[0133] in Represents an exponential function. This indicates the sequence length of the age-appropriate video script. This represents the position index in the age-appropriate video script, with a value ranging from 1 to... ; Represents the logarithmic function. This represents the conditional probability, which is the probability of candidate words output by the decoder generated by the script. The script for age-friendly videos is in the first... The current word at each position. Indicates the first All words preceding that position, Indicating understanding the first Under all words preceding the given position, generate an age-friendly video script in the [number]th position. The probability of candidate words for the current word at each position is calculated. The higher the probability, the more the current word conforms to language habits. The fluency penalty term is used to penalize awkward expressions with low generation probabilities, ensuring that the script language is fluent and natural.
[0134] Next, calculate the cognitive constraint regularization term, count the number of segments in the generated script that violate sentence length constraints, terminology constraints, repetition constraints, and metaphor constraints, and calculate the cumulative penalty for exceeding the threshold for each constraint to obtain the cognitive constraint regularization term, with the formula as follows:
[0135]
[0136] in Represents a set of constraints. This represents the threshold value corresponding to the constraint type. This indicates the current constraint type. Calculating the cognitive constraint regularization term is to transform hard cognitive constraints into optimizable soft constraints, guiding the model to gradually meet the cognitive requirements of the elderly.
[0137] Finally, the multi-task loss function is obtained by multiplying the mean squared error loss of the total adaptation score by a preset first weight coefficient, the mean absolute error loss of the dimension scores by a preset second weight coefficient, the fluency penalty term by a preset third weight coefficient, and the cognitive constraint regularization term by a preset fourth weight coefficient, and then summing them up. The specific formula is as follows:
[0138]
[0139] in This represents the first weighting coefficient. This represents the second weighting coefficient. This represents the third weighting coefficient. This represents the fourth weighting coefficient.
[0140] Step S204: Based on the initial multimodal deep network model, perform the first training phase, perform synonym replacement, sentence rearrangement and professional terminology replacement on the professional health text, train for a preset first number of iterations, and obtain the first optimized model.
[0141] It should be noted that, based on the initial multimodal deep network model, the pre-training stage of the scoring prediction model is performed. A preset optimizer is used, and a preset initial learning rate and weight decay coefficient are set. The learning rate is adjusted by linear warm-up followed by cosine decay. Synonym replacement, sentence rearrangement, and professional terminology replacement are performed on the professional health text. The model is trained for a preset number of first iterations to obtain the first optimized model.
[0142] Specifically, in the pre-training phase of the rating prediction model based on the initial multimodal deep network model, all parameters of the script generation decoder are frozen, and only the parameters of the text encoder, cognitive feature encoder, feature fusion layer, and rating prediction head are opened for training. This is to allow the model to learn the rules of evaluation first and avoid interference from the generation task. Second, an adaptive moment estimation optimizer is used as the preset optimizer, with a preset initial learning rate of 0.00002 and a weight decay coefficient of 0.01. This is to control the step size and magnitude of parameter updates and prevent overfitting. Then, the learning rate is adjusted. For the first 1000 steps, a linear warm-up strategy is used to gradually increase the learning rate from 0 to the preset initial learning rate. After 1000 steps, a cosine decay strategy is used to make the learning rate decrease periodically with the number of training rounds, resulting in the adjusted learning rate. This is to stabilize the model in the early stage of training and to fine-tune it in the later stage. Next, data augmentation processing is performed on the professional health texts in the training set, including synonym replacement based on a medical thesaurus, sentence reordering, and professional terminology replacement while retaining core concepts, to obtain augmented training samples. This is done to expand data diversity and improve model robustness. Then, the augmented training samples are input into the initial multimodal deep network model. Features are extracted through a text encoder and a cognitive feature encoder, cross-modal fusion is performed through a feature fusion layer, and the adaptation score prediction value is output through a rating prediction head. The multi-task loss function value is calculated, and the trainable parameters are updated through backpropagation. The forward calculation and backpropagation are repeated, and the model is trained for a preset first iteration number (30 rounds in this embodiment) until the loss function converges or the preset first iteration number is reached, to obtain the first optimized model. This is done to enable the model to master the mapping relationship from multimodal features to adaptation scores, laying the foundation for subsequent joint training.
[0143] Step S205: Based on the first optimized model, perform the second training phase, design a reward function that integrates adaptation score, semantic similarity and generation diversity, train the second iteration number with a preset batch size, and obtain the second optimized model.
[0144] It should be noted that, based on the first optimized model, the second training phase is performed. A reward function that integrates adaptation score, semantic similarity and generation diversity is designed. A proximate strategy optimization algorithm is adopted, preset clipping parameters are set, and a preset number of second iterations are trained with a preset batch size to obtain the second optimized model.
[0145] Specifically, firstly, based on the first optimized model, a second training phase is performed. All parameters of the script generation decoder are unfrozen, and all parameters of the text encoder, cognitive feature encoder, feature fusion layer, rating prediction head, and script generation decoder are opened for end-to-end training. This is done to jointly optimize the rating prediction and script generation tasks, achieving a synergistic improvement in adaptation assessment and content generation. Secondly, a reward function is designed, with the specific formula as follows:
[0146]
[0147] in, This represents the predicted fit score output by the first optimized model. This indicates the semantic similarity between the generated script and the original professional health text based on word embeddings. This represents the discriminative measure between the generated script and other samples in the training set, i.e., generation diversity. Then, a proximal policy optimization algorithm is used as the optimization strategy. A preset editing parameter (0.2 in this embodiment) is set to limit the policy update magnitude, and a preset batch size (32 in this embodiment) is set. Training samples are input into the model in batches. This is done to stabilize the reinforcement learning training process and avoid performance crashes caused by excessive policy updates. Next, the enhanced training samples are input into the first optimization model. An age-appropriate video script is generated through the script generation decoder, the comprehensive reward function value is calculated, the policy gradient is calculated based on the proximal policy optimization algorithm, and all trainable parameters are updated. Generation, reward calculation, and parameter updates are repeated for a preset second iteration number (50 rounds in this embodiment) until the reward function converges or reaches the preset second iteration number, resulting in the second optimization model. This is done to further improve the generation quality through reinforcement learning fine-tuning, enabling the model to learn to generate highly adaptive scripts while satisfying cognitive constraints.
[0148] Step S206: Input the training set into the second optimization model to obtain the prediction results.
[0149] It should be noted that the preprocessed training set data is retrieved from the gold standard sample library. Following the input format required by the second optimization model, the multimodal fusion features and original professional health text in the training set are organized into input tensors of a fixed batch size. This is done to ensure that the input data format matches the model's input layer, improving model inference efficiency and avoiding format errors. Then, the organized input tensors are input into the second optimization model. The model sequentially encodes the input features through a text encoder and a cognitive feature encoder, completes feature interaction through a feature fusion layer, performs adaptation score prediction through a scoring prediction head, and generates age-appropriate video scripts through a script generation decoder. This is done to allow the model to fully utilize the learned medical semantics and cognitive adaptation rules for aging, outputting predictive content that meets the requirements. Finally, the model's five-dimensional score predictions, total adaptation score predictions, and age-appropriate video scripts are collected and integrated to form a complete prediction result. This is done to compare the prediction results with the labeled data in the training set, calculate the loss to judge the model's training effect, and provide a basis for further model optimization.
[0150] Step S207: Calculate the loss value based on the multi-task loss function, the prediction result, and the actual result.
[0151] It's important to note that the pre-defined parameters of the multi-task loss function are retrieved to specify the weighting coefficients for the total score mean squared error loss, the dimensional score mean absolute error loss, the fluency penalty term, and the cognitive constraint regularization term. Simultaneously, the obtained prediction results and their corresponding true results are extracted and aligned according to the same batch and dimension. This ensures the accuracy of the loss calculation and avoids calculation biases caused by data misalignment or missing parameters. Then, the aligned prediction results and true results are input into the corresponding calculation modules of the multi-task loss function to calculate the values of each loss term sequentially. Finally, the weighted sums of these loss terms are performed based on the pre-defined weighting coefficients. This comprehensively considers the model's biases in both score prediction and script generation, balancing medical accuracy and cognitive adaptability for the elderly, and avoiding model optimization imbalances caused by a single loss term. Finally, the weighted sums are aggregated to obtain the final loss value. This provides a clear basis for intuitively quantifying the model's current prediction bias, offering a clear basis for subsequent model parameter updates and gradient adjustments, and driving continuous model optimization to improve prediction accuracy and script generation quality.
[0152] Step S208: Until the loss value converges or the maximum preset number of iterations is reached, the current second optimization model is used as the preset multimodal deep network model.
[0153] It's important to note that the loss value calculated after each training round is monitored in real-time. The difference between the current loss value and the previous loss value is compared to determine if the difference is less than a preset convergence threshold. Simultaneously, the current training iteration count is recorded. This is done to monitor the model's optimization progress and determine if the model has reached a stable performance state. Then, if the change in the loss value is less than the convergence threshold for several consecutive iterations, the loss value is considered converged, and model training is immediately terminated. If the loss value has not converged, but the current iteration count has reached the preset maximum number of iterations (80 in this example), model training is also terminated. This is to avoid overfitting due to overtraining and to prevent the ineffective consumption of computing power and time due to the loss value not converging for a long period. Finally, the second optimized model at the time of training termination is saved as a preset multimodal deep network model, completing the entire model training process. This is done to determine the final model version used for actual inference, ensuring that the model has stable adaptive score prediction capabilities and age-appropriate video script generation capabilities.
[0154] This embodiment retrieves samples from the gold standard sample library to divide them into training and validation sets, constructs an initial model and designs a multi-task loss function, obtains a second optimized model through two-stage training, and then determines the final model by the convergence of the loss value or the upper limit of the iteration, thereby improving the model training accuracy, taking into account both the adaptation score prediction and script generation effects, and ensuring that the model output is stable and reliable.
[0155] Based on the first embodiment of this application, this application also provides an age-friendly video script generation system based on multimodal cognition. Please refer to... Figure 3 The system includes:
[0156] Module 10 is used to construct a professional knowledge-senior cognitive adaptation assessment system and establish a gold standard sample library to obtain sample data labeled with adaptation scores. The adaptation assessment system includes a five-dimensional quantitative indicator system and a method for calculating the total adaptation score. The five-dimensional quantitative indicator system includes a conceptual complexity dimension score, a language affinity dimension score, a logical clarity dimension score, a life relevance dimension score, and a memory friendliness dimension score.
[0157] The feature extraction module 20 is used to extract text modality features and cognitive adaptation features based on sample data. The text modality features include medical ontology features and language features, and the cognitive adaptation features include colloquialization conversion features, logical conversion features and metaphor system features.
[0158] The prediction module 30 is used to input the text modal features and the cognitive adaptation features into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script. The preset multimodal deep network model includes a text encoder, a cognitive feature encoder, a feature fusion layer, a score prediction head, and a script generation decoder. The text encoder adopts a 3-layer first Transformer encoder, and each first Transformer encoder uses 8 attention heads and a feedforward neural network. The cognitive feature encoder adopts a 2-layer second Transformer encoder, and each second Transformer encoder uses 4 attention heads and a feedforward neural network. The score prediction head adopts a 2-layer fully connected network, and the script generation decoder adopts a constraint search strategy.
[0159] The results module 40 is used to iteratively optimize or output the age-appropriate video script based on the comparison between the predicted adaptation score and the preset adaptation score threshold, so as to obtain the target age-appropriate video script.
[0160] The aging-friendly video script generation system based on multimodal cognition provided in this application, employing the aging-friendly video script generation method based on multimodal cognition in the above embodiments, can solve the technical problem of how to automatically generate video scripts that conform to cognitive adaptation characteristics. Compared with the prior art, the beneficial effects of the aging-friendly video script generation system based on multimodal cognition provided in this application are the same as those of the aging-friendly video script generation method based on multimodal cognition provided in the above embodiments, and other technical features of the aging-friendly video script generation system based on multimodal cognition are the same as those disclosed in the methods of the above embodiments, and will not be repeated here.
[0161] In one embodiment, the acquisition module 10 is further configured to define the mathematical expression of the five-dimensional quantitative indicator system; determine the preset dimension weight vector of the five-dimensional quantitative indicator system using the analytic hierarchy process (AHP); calculate the total fitting score based on the preset dimension weight vector and the score vector of each dimension; form an annotation team composed of geriatric medicine experts, linguistics experts, and elderly user representatives to perform five-dimensional scoring and total fitting score annotation on a preset number of original professional text and manually converted script sample pairs to obtain initial annotation results; calculate the intra-group correlation coefficient of the initial annotation results; when the intra-group correlation coefficient is less than a preset consistency threshold, remove the corresponding initial annotation results; integrate the remaining initial annotation results to establish the gold standard sample library to obtain the sample data.
[0162] In one embodiment, the prediction module 30 is further configured to: input the text modal features into a text encoder, extract deep semantic representations through a multi-layer self-attention mechanism and a feedforward neural network to obtain text-encoded features; input the cognitive adaptation features into a cognitive feature encoder, extract cognitive pattern representations through a Transformer architecture to obtain cognitive-encoded features; input the text-encoded features and the cognitive-encoded features into a feature fusion layer for cross-modal space alignment and fusion processing to obtain multi-modal fusion features; input the multi-modal fusion features into a scoring prediction head, regress and predict five-dimensional scores and total adaptation scores through a two-layer fully connected network to obtain a predicted adaptation score; and input the multi-modal fusion features into a script generation decoder, generate an age-appropriate video script through a constraint search strategy, wherein the constraints of the constraint search include sentence length constraints, terminology constraints, repetition constraints, and metaphor constraints.
[0163] This application provides an aging-friendly video script generation device based on multimodal cognition. The aging-friendly video script generation device based on multimodal cognition includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the aging-friendly video script generation method based on multimodal cognition in the above embodiment 1.
[0164] The following is for reference. Figure 4 The diagram illustrates a structural schematic of an age-friendly video script generation device based on multimodal cognition, suitable for implementing embodiments of this application. The age-friendly video script generation device based on multimodal cognition in the embodiments of this application may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PMPs (Portable Media Players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. Figure 4 The aging-friendly video script generation device based on multimodal cognition shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.
[0165] like Figure 4As shown, the aging-friendly video script generation device based on multimodal cognition may include a processing unit 1001 (e.g., a central processing unit, a graphics processing unit, etc.), which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 1002 or a program loaded from storage device 1003 into random access memory (RAM) 1004. The RAM 1004 also stores various programs and data required for the operation of the aging-friendly video script generation device based on multimodal cognition. The processing unit 1001, ROM 1002, and RAM 1004 are interconnected via a bus 1005. An input / output (I / O) interface 1006 is also connected to the bus. Typically, the following can be connected to I / O interface 1006: input devices 1007 including, for example, touchscreens, touchpads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, etc.; output devices 1008 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 1003 including, for example, magnetic tapes, hard disks, etc.; and communication devices 1009. Communication device 1009 allows the multimodal cognition-based aging-friendly video script generation device to communicate wirelessly or wiredly with other devices to exchange data. Although various multimodal cognition-based aging-friendly video script generation devices are shown in the figures, it should be understood that implementation or possession of all of them is not required. More or fewer may be implemented alternatively.
[0166] Specifically, according to the embodiments disclosed in this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from storage device 1003, or installed from ROM 1002. When the computer program is executed by processing device 1001, it performs the functions defined in the methods of the embodiments disclosed in this application.
[0167] The aging-friendly video script generation device based on multimodal cognition provided in this application, employing the aging-friendly video script generation method based on multimodal cognition in the above embodiments, can solve the technical problem of how to automatically generate video scripts that conform to cognitive adaptation characteristics. Compared with the prior art, the beneficial effects of the aging-friendly video script generation device based on multimodal cognition provided in this application are the same as those of the aging-friendly video script generation method based on multimodal cognition provided in the above embodiments, and other technical features in this aging-friendly video script generation device are the same as those disclosed in the previous embodiment method, and will not be repeated here.
[0168] It should be understood that the various parts disclosed in this application can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics can be combined in any suitable manner in one or more embodiments or examples.
[0169] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0170] This application provides a computer-readable medium having computer-readable program instructions (i.e., a computer program) stored thereon, the computer-readable program instructions being used to execute the age-friendly video script generation method based on multimodal cognition in the above embodiments.
[0171] The computer-readable medium provided in this application may be, for example, a USB flash drive, but is not limited to electrical, magnetic, optical, electromagnetic, infrared, or semiconductor devices, or any combination thereof. More specific examples of computer-readable media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this embodiment, the computer-readable medium may be any tangible medium containing or storing a program that can be executed by instructions, used by a device, or used in conjunction with it. The program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (Radio Frequency), etc., or any suitable combination thereof.
[0172] The aforementioned computer-readable medium may be included in an age-friendly video script generation device based on multimodal cognition; or it may exist independently and not assembled into an age-friendly video script generation device based on multimodal cognition.
[0173] The aforementioned computer-readable medium carries one or more programs that, when executed by a multimodal cognitive-based aging-friendly video script generation device, enable the device to write computer program code for performing the operations of this application in one or more programming languages or a combination thereof. These programming languages include object-oriented programming languages—such as Java, Smalltalk, and C++—and conventional procedural programming languages—such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0174] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of this application. In this regard, all blocks in the flowcharts or block diagrams may represent a module, segment, or portion of code containing one or more executable instructions for implementing the specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that all blocks in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using dedicated hardware-based implementations that perform the specified functions or operations, or using a combination of dedicated hardware and computer instructions.
[0175] The modules described in the embodiments of this application can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.
[0176] The readable medium provided in this application is a computer-readable medium that stores computer-readable program instructions (i.e., a computer program) for executing the above-described method for generating age-friendly video scripts based on multimodal cognition. This solves the technical problem of how to automatically generate video scripts that conform to cognitive adaptation characteristics. Compared with the prior art, the beneficial effects of the computer-readable medium provided in this application are the same as those of the age-friendly video script generation method based on multimodal cognition provided in the above embodiments, and will not be repeated here.
[0177] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the above-described method for generating age-friendly video scripts based on multimodal cognition.
[0178] The computer program product provided in this application solves the technical problem of how to automatically generate video scripts that conform to cognitive adaptation characteristics. Compared with the prior art, the beneficial effects of the computer program product provided in this application are the same as those of the age-appropriate video script generation method based on multimodal cognition provided in the above embodiments, and will not be repeated here.
[0179] The above description is only a part of the embodiments of this application and does not limit the patent scope of this application. All equivalent structural transformations made under the technical concept of this application and using the contents of the specification and drawings of this application, or direct / indirect applications in other related technical fields, are included in the patent protection scope of this application.
Claims
1. A multi-modal cognition-based video script generation method for the elderly, characterized by, The method includes: A professional knowledge-age cognitive adaptation assessment system was constructed, and a gold standard sample library was established to obtain sample data labeled with adaptation scores. The adaptation assessment system includes a five-dimensional quantitative indicator system and a method for calculating the total adaptation score. The five-dimensional quantitative indicator system includes scores for conceptual complexity, language affinity, logical clarity, relevance to daily life, and memory friendliness. The conceptual complexity score is calculated based on the proportion of professional terms and abstract concepts. Professional terms are identified through matching using the UMLS medical terminology database, and abstract concepts are determined using the BERT abstraction classification model. The language affinity score is calculated by weighting colloquialism score, dialect compatibility, and speech rate compatibility. The colloquialism score is calculated using the matching rate of an n-gram colloquialism database, and the dialect compatibility is calculated using a Chinese dialect recognition model to determine the cosine similarity between the target dialect and the script. Similarity and speech rate suitability are measured by the matching degree between sentence length distribution and the optimal auditory sentence length for the elderly; logical clarity is scored by weighting causal explicitness and step coherence, where causal explicitness is calculated by the density of causal relation words, and step coherence is measured by the Markov chain transition probability entropy value of the event time series; life relevance is scored by weighting metaphorical relevance and scene relevance, where metaphorical relevance is calculated by the semantic distance between "medical concept" and "life concept" using WordNet, and scene relevance is calculated by the matching probability between the script and high-frequency life scenes of the elderly using a scene classification model; memory friendliness is scored by weighting key information repetition and visual cue suitability, where key information repetition is calculated by the interval distribution of TF-IDF keywords in the script, and visual cue suitability is calculated by the matching degree between the script content and suggested visual elements using a text-visual association model. Based on the sample data, text modal features and cognitive adaptation features are extracted, wherein the text modal features include medical ontology features and linguistic features, and the cognitive adaptation features include colloquialization conversion features, logical conversion features, and metaphor system features; The text modal features and the cognitive adaptation features are input into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script. The preset multimodal deep network model includes a text encoder, a cognitive feature encoder, a feature fusion layer, a score prediction head, and a script generation decoder. The text encoder adopts a 3-layer first Transformer encoder, and each first Transformer encoder uses 8 attention heads and a feedforward neural network. The cognitive feature encoder adopts a 2-layer second Transformer encoder, and each second Transformer encoder uses 4 attention heads and a feedforward neural network. The score prediction head adopts a 2-layer fully connected network. The script generation decoder adopts a constraint search strategy. Based on the comparison between the predicted adaptation score and the preset adaptation score threshold, the age-friendly video script is iteratively optimized or output to obtain the target age-friendly video script.
2. The method of claim 1, wherein, The steps involved in constructing a professional knowledge-based cognitive adaptation assessment system for the elderly and establishing a gold standard sample library to obtain sample data labeled with adaptation scores include: Define the mathematical expression of the five-dimensional quantitative indicator system; The analytic hierarchy process is used to determine the preset dimension weight vectors of the five-dimensional quantitative index system, and the adaptation total score is calculated based on the preset dimension weight vectors and the score vectors of each dimension. An annotation team composed of geriatric medicine experts, linguistics experts, and elderly user representatives was formed to conduct five-dimensional scoring and adaptation total score annotation on a predetermined number of original professional text and manually converted script sample pairs to obtain initial annotation results; Calculate the intragroup correlation coefficient of the initial annotation results. When the intragroup correlation coefficient is less than a preset consistency threshold, remove the corresponding initial annotation results, integrate the remaining initial annotation results, establish the gold standard sample library, and obtain the sample data.
3. The method of claim 1, wherein, The steps for extracting text modality features and cognitive adaptation features based on the sample data include: Medical entity recognition is performed on the original professional health text in the sample data to extract medical entities such as disease name, symptom description, treatment plan and medication guidelines, and medical entity relationship triples are constructed to obtain medical ontological features; The original professional health text was segmented, part-of-speech tagging was performed, and syntactic analysis was conducted to extract the density of professional terms, average sentence length, frequency of causal words, and proportion of passive voice to obtain language features; By concatenating the medical ontology features with the language features, text modal features are obtained; The manual conversion scripts in the sample data are subjected to colloquial pattern matching to extract the proportion of modal particles, short sentences, and everyday expressions to obtain colloquial conversion features. The logical structure of the manually converted script is analyzed to extract the proportion of step-by-step expressions, the proportion of total-to-score structure, and the proportion of key information presented in the beginning, thus obtaining logical conversion features; The artificially converted script is subjected to metaphor expression recognition, and the mapping relationship between medical concepts and life scenarios and the frequency of metaphors are extracted to obtain the metaphor system features; The colloquialization conversion feature, the logical conversion feature, and the metaphor system feature are concatenated to obtain the cognitive adaptation feature.
4. The method of claim 1, wherein, Before the step of inputting the text modal features and the cognitive adaptation features into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script, the following steps are included: The sample data of a preset number of labeled samples in the gold standard sample library are retrieved as the training set and the validation set, and the validation set includes the real results labeled in the gold standard sample library; Construct an initial multimodal deep network model; Design a multi-task loss function, which is a weighted sum of the mean squared error of the total adaptive score, the mean absolute error of the dimension score, the fluency penalty term, and the cognitive constraint regularization term. Based on the first training phase of the initial multimodal deep network model, synonym replacement, sentence rearrangement, and professional terminology replacement are performed on professional health texts, and the model is trained for a preset first number of iterations to obtain the first optimized model. The second training phase is performed based on the first optimization model. A reward function that integrates adaptation score, semantic similarity and generation diversity is designed. The second optimization model is trained with a preset batch size and a preset number of iterations to obtain the second optimization model. The training set is input into the second optimization model to obtain the prediction result; The loss value is calculated based on the multi-task loss function, the prediction result, and the actual result. Until the loss value converges or the maximum preset number of iterations is reached, the current second optimization model is used as the preset multimodal deep network model.
5. The method of claim 1, wherein, The step of inputting the text modality features and the cognitive adaptation features into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script includes: The text modal features are input into a text encoder, and deep semantic representations are extracted through a multi-layer self-attention mechanism and a feedforward neural network to obtain text encoding features; The cognitive adaptation features are input into the cognitive feature encoder, and the cognitive pattern representation is extracted through the Transformer architecture to obtain the cognitive encoding features; The text encoding features and the cognitive encoding features are input into the feature fusion layer for cross-modal space alignment and fusion processing to obtain multimodal fusion features; The multimodal fusion features are input into the scoring prediction head, and the five-dimensional scores and the total adaptation score are predicted by regression through a two-layer fully connected network to obtain the predicted adaptation score. The multimodal fusion features are input into the script generation decoder, and an age-appropriate video script is generated through a constraint search strategy. The constraints of the constraint search include sentence length constraints, terminology constraints, repetition constraints, and metaphor constraints.
6. The method of claim 5, wherein, The step of inputting the text encoding features and the cognitive encoding features into the feature fusion layer for cross-modal space alignment and fusion processing to obtain multimodal fused features includes: Obtain the first dimension parameter of the text encoding feature and the second dimension parameter of the cognitive encoding feature, and project the text encoding feature and the cognitive encoding feature onto a preset unified dimension space through linear transformation to obtain the first feature and the second feature; Calculate the cosine similarity between the first feature and the second feature, and maximize the mutual information between the first feature and the second feature under the same sample using InfoNCE loss to obtain the semantically aligned first feature and the semantically aligned second feature. The semantically aligned first feature is used as the query vector, and the semantically aligned second feature is used as the key vector and value vector, respectively. These are then input into a cross-attention mechanism for feature interaction calculation to obtain the output feature. The output features are normalized and dimensionality compressed to obtain the initial fused features; The initial fused features are input into a feedforward neural network for feature depth extraction, which strengthens the correlation representation between features and obtains multimodal fused features.
7. The method of claim 1, wherein, The step of iteratively optimizing or outputting the age-appropriate video script based on the comparison between the predicted adaptation score and the preset adaptation score threshold to obtain the target age-appropriate video script includes: The predicted adaptation score is compared with the preset adaptation score threshold to obtain the comparison result; When the comparison result is that the predicted adaptation score is greater than or equal to the preset adaptation score threshold, visual cue markers and virtual character parameters are added to the age-friendly video script to obtain the first target age-friendly video script. The visual cue markers are chart insertion position indicators, and the virtual character parameters include facial expression parameters, speech rate parameters, and clothing parameters. When the comparison result is that the predicted adaptation score is less than the preset adaptation score threshold, the target optimization dimension is identified based on the low-scoring dimension among the five dimensions, and the cognitive adaptation feature weights corresponding to the target optimization dimension are adjusted to obtain the adjusted cognitive adaptation features. The adjusted cognitive adaptation features and the text modality features are then re-aligned and fused across modal space to obtain optimized fused features; The optimized fusion features are input into the preset multimodal deep network model to regenerate the age-appropriate video script, thus obtaining the iterative optimization script; Until the predicted adaptation score of the iterative optimization script is greater than or equal to the preset adaptation score threshold or the number of iterations reaches the preset maximum number of iterations, the iterative optimization script will be used as the second target aging-adaptive video script. The first target age-friendly video script or the second target age-friendly video script is used as the target age-friendly script.
8. A multi-modal cognition based seniorization video script generation system, characterized in that, The system includes: The acquisition module is used to construct a professional knowledge-age cognitive adaptation assessment system and establish a gold standard sample library to obtain sample data labeled with adaptation scores. The adaptation assessment system includes a five-dimensional quantitative indicator system and a method for calculating the total adaptation score. The five-dimensional quantitative indicator system includes a conceptual complexity dimension score, a language affinity dimension score, a logical clarity dimension score, a relevance to daily life dimension score, and a memory friendliness dimension score. The conceptual complexity dimension score is calculated based on the proportion of professional terms and the proportion of abstract concepts. Professional terms are identified through matching using the UMLS medical terminology database, and abstract concepts are determined using the BERT abstraction classification model. The language affinity dimension score is calculated based on a weighted average of spoken language score, dialect compatibility, and speech rate compatibility. The spoken language score is calculated using the matching rate of an n-gram spoken language pattern library, and the dialect compatibility is calculated using a Chinese dialect recognition model to compare the target dialect with the script. Cosine similarity and speech rate suitability are measured by the matching degree between sentence length distribution and the optimal auditory sentence length for the elderly; the logical clarity dimension score is calculated by weighting causal explicitness and step coherence, where causal explicitness is calculated by the density of causal relation words, and step coherence is measured by the Markov chain transition probability entropy value of the event time series; the life relevance dimension score is calculated by weighting metaphorical relevance and scene relevance, where metaphorical relevance is calculated by the semantic distance between "medical concept" and "life concept" using WordNet, and scene relevance is calculated by the matching probability between the script and high-frequency life scenes of the elderly using a scene classification model; the memory friendliness dimension score is calculated by weighting key information repetition and visual cue suitability, where key information repetition is calculated by the interval distribution of TF-IDF keywords in the script, and visual cue suitability is calculated by the matching degree between the script content and the suggested visual elements using a text-visual association model; The feature extraction module is used to extract text modality features and cognitive adaptation features based on the sample data, wherein the text modality features include medical ontology features and language features, and the cognitive adaptation features include colloquialization conversion features, logical conversion features and metaphor system features; The prediction module is used to input the text modal features and the cognitive adaptation features into a preset multimodal deep network model to obtain the adaptation score prediction value and the age-appropriate video script. The preset multimodal deep network model includes a text encoder, a cognitive feature encoder, a feature fusion layer, a score prediction head, and a script generation decoder. The text encoder adopts a 3-layer first Transformer encoder, and each first Transformer encoder uses 8 attention heads and a feedforward neural network. The cognitive feature encoder adopts a 2-layer second Transformer encoder, and each second Transformer encoder uses 4 attention heads and a feedforward neural network. The score prediction head adopts a 2-layer fully connected network. The script generation decoder adopts a constraint search strategy. The results module is used to iteratively optimize or output the age-friendly video script based on the comparison between the predicted adaptation score and the preset adaptation score threshold, so as to obtain the target age-friendly video script.
9. The system of claim 8, wherein, The acquisition module is also used to define the mathematical expression of the five-dimensional quantitative indicator system; to determine the preset dimension weight vector of the five-dimensional quantitative indicator system using the analytic hierarchy process; to calculate the total fitting score based on the preset dimension weight vector and the score vector of each dimension; and to form an annotation team composed of geriatric medicine experts, linguistics experts and elderly user representatives to perform five-dimensional scoring and total fitting score annotation on a preset number of original professional text and manually converted script sample pairs to obtain the initial annotation results. Calculate the intragroup correlation coefficient of the initial annotation results. When the intragroup correlation coefficient is less than a preset consistency threshold, remove the corresponding initial annotation results, integrate the remaining initial annotation results, establish the gold standard sample library, and obtain the sample data.
10. The system of claim 8, wherein, The prediction module is further configured to: input the text modal features into a text encoder, extract deep semantic representations through a multi-layer self-attention mechanism and a feedforward neural network to obtain text-encoded features; input the cognitive adaptation features into a cognitive feature encoder, extract cognitive pattern representations through a Transformer architecture to obtain cognitive-encoded features; input the text-encoded features and the cognitive-encoded features into a feature fusion layer for cross-modal space alignment and fusion processing to obtain multimodal fusion features; input the multimodal fusion features into a scoring prediction head, regress and predict five-dimensional scores and total adaptation scores through a two-layer fully connected network to obtain adaptation score prediction values; and input the multimodal fusion features into a script generation decoder, generate age-appropriate video scripts through a constraint search strategy, wherein the constraint search constraints include sentence length constraints, terminology constraints, repetition constraints, and metaphor constraints.