A practical skill evaluation method and system fusing computer vision and large language model

By integrating computer vision and large language models, an expert authoritative evaluation database is constructed, which solves the problems of rigid evaluation logic and rule dependence in existing technologies. This enables high-precision and flexible evaluation of medical practice skills, generates highly consistent evaluation results, and provides targeted feedback.

CN122243275APending Publication Date: 2026-06-19NANJING MEDICAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING MEDICAL UNIV
Filing Date
2026-03-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing medical practice skills assessment technologies rely too heavily on rigid rules, failing to capture the examiner's assessment wisdom. This results in a disconnect between detailed calculations and overall logical evaluation, leading to high maintenance costs and low iteration efficiency.

Method used

By integrating computer vision and large language models, an authoritative expert evaluation database is constructed. Action features are extracted through video analysis and large language models, and a global logical review is performed to generate a comprehensive score, thus achieving intelligent evaluation.

Benefits of technology

It achieves high-precision and flexible evaluation results, avoids background noise interference, has strong evaluation consistency, possesses powerful feedback capabilities, can identify and correct logical defects of local excellence and overall poor performance, and generates humanized teaching comments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243275A_ABST
    Figure CN122243275A_ABST
Patent Text Reader

Abstract

This invention discloses a practical skills evaluation method and system integrating computer vision and a large language model. The method includes: Step A, establishing an observation point set; Step B, constructing a video observation point feature vector database; Step C, extracting video segments of each observation point from the video to be evaluated, and then generating feature vectors to be evaluated; Step D, calculating the similarity between the feature vector to be evaluated and the feature vectors of the same observation point in the feature vector database, retrieving the top K feature vectors with the highest similarity, and taking the average of their corresponding scores as the score for that observation point; Step E, weighted summing of the global score and the scores of each observation point to calculate the final global comprehensive score. The evaluation score of this invention directly stems from the consensus of experts on similar operations, and also combines expert-level review of the global logic of operations using a large language model, realizing an intelligent evaluation system that highly replicates the examiner's evaluation thinking.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and medical education technology, and more specifically, to a practical skills assessment method that integrates computer vision and large language models. Background Technology

[0002] Currently, medical practice skills assessment mainly relies on traditional human examiner scoring or rudimentary automated assessment models using computer vision (CV) + rule engines. With the development of artificial intelligence technology, skills assessment techniques based on motion capture and motion analysis are gradually being applied to clinical teaching.

[0003] However, existing automated evaluation schemes exhibit significant limitations when handling complex clinical procedures, with the core problem being a misinterpretation of the evaluation "gold standard":

[0004] (1) Rigidity of evaluation logic and rule dependence

[0005] Existing technical solutions are mainly based on "gap analysis," which involves pre-defining a standard operating model and calculating the geometric or spatiotemporal difference between the examinee's actions and the standard model, then deducting points according to pre-set rigid scoring rules. This approach heavily relies on the exhaustive search of rule engines and cannot recognize the "contextual" semantics of the operation. For example, in venous blood collection, if an examinee's initial needle insertion angle is too large but they can immediately correct it and successfully collect blood, the traditional system will still judge it as an error due to triggering rigid angle threshold rules, which clearly deviates from the original intention of clinical evaluation.

[0006] (2) The disconnect between evaluation criteria and examiner's intuition

[0007] The true gold standard for clinical practice evaluation should be the professional intuition and experience of senior examiners, including a comprehensive judgment of the quality of movements, the rhythm of the operation, and the consistency of aseptic awareness. Traditional "rule engines" attempt to break down complex expert experience into discrete physical indicators, resulting in evaluation results that often present as a mechanical deduction list (such as "Step 3: -2 points"), lacking an in-depth assessment of the overall logic of the operation and clinical competence.

[0008] (3) The blind spot of evaluation: “some parts are good, but the whole is bad”

[0009] Rule-driven evaluation systems often get bogged down in local, detailed calculations, lacking a grasp of the overall logic. In actual assessments, candidates may perform each isolated action reasonably well, but the overall process logic is chaotic (e.g., repeated back-and-forth operations, unreasonable pauses between operations). Existing CV evaluation methods, lacking the macro-level analytical ability of simulating an examiner, often award high scores that contradict expert intuition.

[0010] (4) High maintenance costs and low iteration efficiency

[0011] Because existing solutions rely heavily on hard-coded rules, engineers need to rewrite, test, and deploy complex logic rules whenever medical standards are updated or new assessment items are added, resulting in lengthy development cycles and limited scalability. This rule-centric architecture restricts the ability of the evaluation system to be migrated to more clinical scenarios.

[0012] In conclusion, existing skills assessment technologies, due to their over-reliance on rigid rules, fail to truly reflect the wisdom of examiners' evaluations. Therefore, developing an intelligent assessment method that can break free from rule constraints, use expert evaluation experience as the digital gold standard, and consider both micro-level detail comparisons and macro-level semantic review has become a crucial issue that urgently needs to be addressed in the field of medical education technology. Summary of the Invention

[0013] The purpose of this invention is to provide a practical skills evaluation method that integrates computer vision and large language models. By constructing an expert authoritative evaluation database, the evaluation logic of senior examiners is digitized. The system uses a video analysis large model to extract the features of the actions to be evaluated and performs deep matching with a large number of expert-annotated samples in the database. The evaluation score is directly derived from the experts' consensus on similar operations, rather than mechanical rule calculation. It also combines the expert-level review of the global logic of the operation by the large language model, realizing an intelligent evaluation system that highly restores the examiner's evaluation thinking.

[0014] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0015] In a first aspect, the present invention provides a practical skills assessment method that integrates computer vision and large language models, the method comprising the following steps:

[0016] Step A: Establish a set of key observation points: For specific clinical practice skills items, deconstruct the entire operation process and define a structured set of key observation points;

[0017] Step B: Construct a video observation point feature vector database: Obtain a large number of operation videos of skill projects and their corresponding authoritative evaluation scores; use a large video analysis model to analyze the videos based on the observation point set and extract video segments corresponding to each observation point; then generate corresponding feature vectors from the extracted video segments; and finally construct a feature vector database based on video encoding, observation point encoding, feature vectors, and scores.

[0018] Step C, Video Processing: Receive the video to be evaluated, use the large video analysis model and observation point set to extract video segments of each observation point from the video to be evaluated, and then generate the feature vector to be evaluated.

[0019] Step D, Intelligent Matching and Scoring: Calculate the similarity between the feature vector to be evaluated and each feature vector under the same observation point in the feature vector database, retrieve the top K feature vectors with the highest similarity, and take the average of their corresponding scores as the score of the observation point.

[0020] Step E, Global Comprehensive Evaluation: Utilize a large language model to perform full-process semantic analysis on the evaluation points and generate a global score. and its evaluation criteria; The final global comprehensive score is calculated by weighting and summing the scores of each observation point in step D.

[0021] Furthermore, in step A, the observation point set includes action names, specific action definitions, and video analysis instructions.

[0022] Furthermore, in steps B and C, the large-scale video analysis model is the VideoMind semantic parsing model, which is used to accurately locate and extract target action segments from the complete video stream based on natural language instructions.

[0023] Furthermore, in steps B and C, feature vectors are generated using the Video-Swin-Transformer model.

[0024] Furthermore, in step D, the similarity calculation uses the cosine similarity algorithm, and the calculation formula is as follows:

[0025]

[0026] in, Let be the feature vector to be evaluated. These are reference feature vectors in the feature vector database.

[0027] Furthermore, in step D, the score for the observation point is calculated as follows:

[0028]

[0029] in, The authority score is the score of the m-th reference vector with the highest similarity retrieved; K is a preset positive integer.

[0030] Furthermore, in step E, the global comprehensive score The calculation formula is as follows:

[0031]

[0032] in, This represents the global scoring weighting coefficient. The total number of observation points. For the first Feature matching scores for each observation point.

[0033] Secondly, the present invention provides a practical skills assessment system for performing the above-described method, the system comprising:

[0034] The observation point definition module is used to set the operational dimensions and analysis instructions for clinical skills.

[0035] Semantic parsing module: Built-in large video analysis model for automated extraction of key video segments;

[0036] Feature encoding module: Built-in Video-Swin-Transformer model, used to convert video clips into high-dimensional feature vectors;

[0037] Feature vector index library: used to store multi-dimensional operational features and their associated authority scores;

[0038] Intelligent evaluation module: used to perform vector similarity retrieval and calculate scores for each observation point;

[0039] Comprehensive score calculation module: Generates a global score using a large language model. ,Will The scores of each observation point are weighted and summed to calculate the final overall global score.

[0040] Thirdly, the present invention provides a computer (electronic) device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor, when executing the computer-readable instructions, implements the practical skills assessment method as described above.

[0041] Fourthly, the present invention provides a readable storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the practical skills assessment method as described above.

[0042] The beneficial effects of this invention are as follows:

[0043] High precision and flexibility: Actions are located through Large Language Model (LLM), avoiding background noise interference; vector matching enables comprehensive logical reasoning that is closer to that of human experts than hard rules.

[0044] Consistency of evaluation: Based on mathematical distance calculation, the scoring fluctuations caused by individual differences among examiners are eliminated.

[0045] Powerful feedback capabilities: The system can generate personalized and targeted teaching comments based on vector differences and a large language model.

[0046] Overcoming the evaluation blind spot of "partial excellence, overall poor": Traditional CV evaluation methods may award high scores because a student's individual actions are performed reasonably well, but the overall logic is chaotic (such as repeated back-and-forth operations). This invention, through weighted global scoring, can effectively identify and correct such logical flaws, making AI scoring closer to the clinical intuition of senior experts. Attached Figure Description

[0047] Figure 1 This is a flowchart of the practical skills evaluation method of the present invention;

[0048] Figure 2 This is a block diagram of the practical skills evaluation system of the present invention. Detailed Implementation

[0049] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.

[0050] like Figure 1 This invention provides a practical skills evaluation method that integrates computer vision and large language models. The method includes the following steps:

[0051] Step A: Establish a set of observation points: For specific clinical practice skills items, deconstruct the entire operation process and define a structured set of observation points; the set of observation points includes action names, specific action definitions and video analysis instructions;

[0052] Step B: Construct a video observation point feature vector database: Obtain a large number of operation videos of skill projects and their corresponding authoritative evaluation scores; use a large video analysis model to analyze the videos based on the observation point set and extract video segments corresponding to each observation point; then generate corresponding feature vectors from the extracted video segments; and finally construct a feature vector database based on video encoding, observation point encoding, feature vectors, and scores.

[0053] Step C, Video Processing: Receive the video to be evaluated, use the large video analysis model and observation point set to extract video segments of each observation point from the video to be evaluated, and then generate the feature vector to be evaluated.

[0054] Step D, Intelligent Matching and Scoring: Calculate the similarity between the feature vector to be evaluated and each feature vector under the same observation point in the feature vector database, retrieve the top K feature vectors with the highest similarity, and take the average of their corresponding scores as the score of the observation point.

[0055] Step E, Global Comprehensive Evaluation: Utilize a large language model to perform full-process semantic analysis on the evaluation points and generate a global score. and its evaluation criteria; The final global comprehensive score is calculated by weighting and summing the scores of each observation point in step D.

[0056] In one embodiment, in steps B and C, the large-scale video analysis model is the VideoMind semantic parsing model, which is used to accurately locate and extract target action segments from the complete video stream based on natural language instructions.

[0057] In one embodiment, in steps B and C, feature vectors are generated using the Video-Swin-Transformer model.

[0058] In one embodiment, the similarity calculation in step D uses the cosine similarity algorithm, and the calculation formula is as follows:

[0059]

[0060] in, Let be the feature vector to be evaluated. These are reference feature vectors in the feature vector database.

[0061] The scoring method for the observation points is as follows:

[0062]

[0063] in, The authority score is the score of the m-th reference vector with the highest similarity retrieved; K is a preset positive integer.

[0064] In one embodiment, considering the continuity of clinical procedures, this invention utilizes the long text / long video understanding capabilities of LLM to perform a full-time scan of the video to be evaluated. LLM not only focuses on the accuracy of individual actions but also assesses the operator's sense of rhythm, the consistency of aseptic awareness, and the logical connection between steps, generating a global score. And its evaluation. Based on the system's set global weight coefficients. This approach integrates the macroscopic sensory score provided by LLM with the microscopic detail score based on CV vector alignment, ensuring that the final score is both data-driven and logically profound. (Global comprehensive score) The calculation formula is as follows:

[0065]

[0066] in, This represents the global scoring weighting coefficient. The total number of observation points. For the first Feature matching scores for each observation point.

[0067] like Figure 2As shown, this embodiment of the invention also provides a practical skills assessment system capable of performing the above-described method, the system comprising:

[0068] The observation point definition module is used to set the operational dimensions and analysis instructions for clinical skills.

[0069] Semantic parsing module: Built-in large video analysis model for automated extraction of key video segments;

[0070] Feature encoding module: Built-in Video-Swin-Transformer model, used to convert video clips into high-dimensional feature vectors;

[0071] Feature vector index library: used to store multi-dimensional operational features and their associated authority scores;

[0072] Intelligent evaluation module: used to perform vector similarity retrieval and calculate scores for each observation point;

[0073] Comprehensive score calculation module: Generates a global score using a large language model. ,Will The scores of each observation point are weighted and summed to calculate the final overall global score.

[0074] The following uses the "venipuncture" procedure as an example to illustrate the specific implementation of this invention:

[0075] Step 1: Define the key observation points;

[0076] Define the key points of "venous blood collection", such as "needle holding angle", "integrity of key steps" and "aseptic operation".

[0077] Step 2: Database construction;

[0078] The system used VideoMind to identify “needle insertion” segments in 10,000 historical videos.

[0079] Encoding: Video-Swin-Transformer converts the needle insertion segment into a vector. .

[0080] Storage: Associate the vector with the expert's score of "95" and store it in the database.

[0081] Step 3: Intelligent evaluation execution;

[0082] Receive new candidate videos and extract their needle insertion segment vectors. .

[0083] Calculation: System Discovery The similarity with the five vectors in the library (K=5) with scores between 85 and 90 all exceeded 0.95.

[0084] Output: The average score is 88 points.

[0085] Step 4: Overall Comprehensive Evaluation (Taking "Venous Blood Collection" as an Example);

[0086] Global Scan: After completing the segment extraction, the large model (VideoMind) performs a second semantic review of the entire video process.

[0087] Global score generation: LLM identified that although the candidate scored highly in the "holding the needle" and "inserting the needle" actions, the overall hand movements were slightly stiff, and there was unnecessary hesitation between different steps. LLM gave a global score of 82.

[0088] Weighted calculation:

[0089] Assume the average score for each key point = 88 points.

[0090] Set global weights = 0.3.

[0091] Final overall score

[0092] Evaluation criteria output: LLM automatically generated the following criteria: Overall score 86.2. The main reason for the deductions was the inconsistent operation rhythm (global evaluation). Although your puncture technique met the standards, the connection between locating the vein and applying the adhesive tape took too long. It is recommended to improve your operational proficiency through simulation training.

[0093] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive.

[0094] It should be understood that the processor in the embodiments of this application can be a central processing unit, or it can be other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0095] It should also be understood that the memory in the embodiments of this application can be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory can be read-only memory, programmable read-only memory, erasable programmable read-only memory, electrically erasable programmable read-only memory, or flash memory. The volatile memory can be random access memory, which is used as an external cache.

[0096] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the above embodiments do not limit the scope of protection of the present invention in any way, and all technical solutions obtained by equivalent substitution or other means fall within the scope of protection of the present invention.

[0097] All parts not covered in this invention are the same as or can be implemented using existing technologies.

Claims

1. A practical skills assessment method integrating computer vision and large language models, characterized in that, The method includes the following steps: Step A: Establish a set of key observation points: For specific clinical practice skills items, deconstruct the entire operation process and define a structured set of key observation points; Step B: Construct a video observation point feature vector database: Obtain a large number of operation videos of skill projects and their corresponding authoritative evaluation scores; use a large video analysis model to analyze the videos based on the observation point set and extract video segments corresponding to each observation point; then generate corresponding feature vectors from the extracted video segments; and finally construct a feature vector database based on video encoding, observation point encoding, feature vectors, and scores. Step C, Video Processing: Receive the video to be evaluated, use the large video analysis model and observation point set to extract video segments of each observation point from the video to be evaluated, and then generate the feature vector to be evaluated. Step D, Intelligent Matching and Scoring: Calculate the similarity between the feature vector to be evaluated and each feature vector under the same observation point in the feature vector database, retrieve the top K feature vectors with the highest similarity, and take the average of their corresponding scores as the score of the observation point. Step E, Global Comprehensive Evaluation: Utilize a large language model to perform full-process semantic analysis on the evaluation points and generate a global score. and its evaluation criteria; The final global comprehensive score is calculated by weighting and summing the scores of each observation point in step D.

2. The practical skills evaluation method integrating computer vision and large language model as described in claim 1, characterized in that, In step A, the observation point set includes action names, specific action definitions, and video analysis instructions.

3. The practical skills evaluation method integrating computer vision and large language model as described in claim 1, characterized in that, In steps B and C, the large-scale video analysis model is the VideoMind semantic parsing model, which is used to accurately locate and extract target action segments from the complete video stream based on natural language instructions.

4. The practical skills evaluation method integrating computer vision and large language model as described in claim 1, characterized in that, In steps B and C, feature vectors are generated using the Video-Swin-Transformer model.

5. The practical skills evaluation method integrating computer vision and large language model according to claim 1, characterized in that, In step D, the similarity calculation uses the cosine similarity algorithm, and the calculation formula is as follows: in, Let be the feature vector to be evaluated. These are reference feature vectors in the feature vector database.

6. The practical skills evaluation method integrating computer vision and large language model according to claim 1, characterized in that, In step D, the score for the observation point is calculated as follows: in, The authority score is the score of the m-th reference vector with the highest similarity retrieved; K is a preset positive integer.

7. The practical skills evaluation method integrating computer vision and large language model as described in claim 1, characterized in that, In step E, the global comprehensive score The calculation formula is as follows: in, This represents the global scoring weighting coefficient. The total number of observation points. For the first Feature matching scores for each observation point.

8. A practical skills assessment system that performs the method as described in any one of claims 1 to 7, characterized in that, The system includes: The observation point definition module is used to set the operational dimensions and analysis instructions for clinical skills. Semantic parsing module: Built-in large video analysis model for automated extraction of key video segments; Feature encoding module: Built-in Video-Swin-Transformer model, used to convert video clips into high-dimensional feature vectors; Feature vector index library: used to store multi-dimensional operational features and their associated authority scores; Intelligent evaluation module: used to perform vector similarity retrieval and calculate scores for each observation point; Comprehensive score calculation module: Generates a global score using a large language model. ,Will The scores of each observation point are weighted and summed to calculate the final overall score.

9. A computer (electronic) device, comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that, When the processor executes the computer-readable instructions, it implements the practical skills assessment method as described in any one of claims 1 to 7.

10. A readable storage medium storing computer-readable instructions, characterized in that, When the computer-readable instructions are executed by one or more processors, the one or more processors cause the one or more processors to perform the practical skills assessment method as described in any one of claims 1 to 7.