Hybrid-scale model coordinated medical terminology multistage standardization method, system, device and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a multi-level standardization approach based on a Hybrid-Scale model, the problems of diversity, non-standardization, and adaptability in existing medical terminology standardization technologies have been resolved. This approach achieves efficient and accurate medical terminology standardization, thereby improving the efficiency and applicability of medical data interconnection.

CN122242511APending Publication Date: 2026-06-19联通数智医疗科技有限公司

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: 联通数智医疗科技有限公司
Filing Date: 2026-01-26
Publication Date: 2026-06-19

Application Information

Patent Timeline

26 Jan 2026

Application

19 Jun 2026

Publication

CN122242511A

IPC: G06F40/30; G06F40/284; G06F16/2455; G06F16/248; G06F16/28; G06F18/15; G06F18/22; G16H70/00; G06N3/045

AI Tagging

Application Domain

Semantic analysis Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing medical terminology standardization technologies cannot simultaneously meet the requirements of semantic understanding depth, matching accuracy, processing efficiency, and multi-standard adaptation capabilities, resulting in low efficiency of medical data interconnection and interoperability, and a lack of interpretable decision-making and automated processing mechanisms.

Method used

A hybrid-scale model collaborative approach is adopted, including data cleaning, preprocessing, text classification, full matching, parallel processing (three-level matching model and deep model text mapping) and comprehensive judgment. Through a multi-level standardization process, efficient, accurate and interpretable standardization of medical terminology is achieved.

Benefits of technology

It significantly improves the efficiency of medical data interconnection, enhances the system's adaptability to multiple standard systems and its domain scalability, reduces manual maintenance costs, and ensures the interpretability and traceability of the standardization process.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242511A_ABST

Patent Text Reader

Abstract

This application discloses a method, system, device, and storage medium for multi-level standardization of medical terminology using a hybrid-scale model collaboration. The method includes: acquiring raw medical text data containing five categories of professional terms; after cleaning and preprocessing, defining the major term categories using a text classification model; determining standard term nodes based on a complete matching of a standard terminology dictionary, where a unique result is obtained; otherwise, determining that standardization is unnecessary or processing through two parallel routes; route one calls a three-level parallel matching model to generate a candidate word list and similarity scores, while route two outputs standard terms and confidence scores through text mapping using a deep model; after comprehensive judgment, a unique standard term node is determined, generating a standardized terminology set containing standardized names, codes, and judgment criteria. This application, through multi-model collaboration and a parallel architecture, balances standardization accuracy and recall, improves robustness to noisy data, ensures process traceability, and significantly improves the efficiency of medical data interconnection.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical terminology standardization technology, and in particular to a method, system, device and storage medium for multi-level standardization of medical terminology using a Hybrid-Scale model. Background Technology

[0002] In the process of developing medical informatization, the expression of medical terms such as clinical diagnosis, surgical procedures, laboratory tests, and pharmaceuticals exhibits significant diversity, regionality, and inconsistency. Different medical institutions, departments, and information systems have vastly different naming habits for the same medical concept. For example, acute myocardial infarction may be recorded as "myocardial infarction," "myocardial infarction," "heart attack," or "AMI," among other forms. The same drug, such as ibuprofen, may have multiple names, including chemical name, generic name, brand name, and common name. Surgical procedure records often suffer from issues such as reversed terminology order, simplified abbreviations, or ambiguous descriptions of surgical sites due to differing descriptive habits. This chaotic state of terminology makes it difficult to effectively identify and share medical data during cross-system exchange, statistical analysis, and knowledge graph construction, severely hindering the efficiency of medical data interconnection and interoperability, and consequently restricting the automation of key services such as clinical decision support, disease monitoring and early warning, and medical insurance settlement.

[0003] Existing technical approaches to standardizing medical terminology mainly include dictionary mapping based on precise string matching, fuzzy matching based on string similarity, rule-based and heuristic methods, traditional machine learning methods, and modern methods based on deep learning and natural language processing. However, all of these methods have insurmountable drawbacks. While dictionary mapping based on precise string matching is simple to implement and has low computational cost, it has extremely poor fault tolerance and cannot handle common problems such as spelling errors, abbreviation variations, synonym substitutions, and word order changes, leading to a significant decrease in matching success rate in practical applications. Fuzzy matching based on string similarity can partially address differences in character-level expression, but it lacks a deep understanding of medical semantics. It often results in the omission of semantically similar but semantically different terms (e.g., "myocardial infarction" and "myocarditis"), or the mismatch of semantically unrelated terms (e.g., "common cold" and "gastrointestinal cold"). Furthermore, it has high computational complexity and relies on manually set thresholds, making it difficult to balance accuracy and recall. Rule-based and heuristic methods achieve high accuracy in specific scenarios, but the construction and maintenance of rule bases and thesaurus require the participation of many medical experts, resulting in high costs and poor maintainability. Complex rules are prone to conflicts and are difficult to extend to new medical fields or new standard systems. Traditional machine learning-based methods have a certain generalization ability, but are severely limited by the quality of manual feature engineering and cannot fully capture the complex semantic relationships between medical terms. Modern methods based on deep learning and natural language processing can understand synonyms and contextual semantics, but the model decision-making process lacks interpretability, limiting its application in high-risk medical fields. They also rely heavily on high-quality labeled data and consume huge amounts of computational resources.

[0004] Furthermore, existing technologies generally suffer from several drawbacks, including difficulty in balancing accuracy and recall, insufficient robustness to noisy data in medical texts (such as typos and colloquial expressions), strong domain dependence leading to poor scalability, and difficulty in flexibly adapting to multiple classification standards such as ICD-10, ICD-9-CM3, SNOMEDCT, and ATC. Particularly when matching results are not unique or fail, the lack of effective automated processing mechanisms and excessive reliance on manual review significantly increase the workload of medical managers. These shortcomings prevent existing solutions from simultaneously meeting the comprehensive requirements of medical terminology standardization for semantic understanding depth, matching accuracy, processing efficiency, and multi-standard adaptability. Therefore, there is an urgent need for an intelligent standardization framework that can deeply integrate multi-dimensional features, provide interpretable decision-making, and adapt to complex medical scenarios. Summary of the Invention

[0005] The purpose of this application is to provide a method, system, electronic device, and non-transitory computer-readable storage medium for multi-level standardization of medical terminology using a Hybrid-Scale model, which has efficient, accurate, and interpretable medical terminology standardization capabilities and significantly improves the efficiency of medical data interconnection.

[0006] This application provides a multi-level standardization method for medical terminology based on Hybrid-Scale model collaboration, including: Obtain raw medical text data, which includes content related to five categories of professional terms: diagnosis, surgery, testing, examination, and pharmaceuticals. It supports importing data into the data element mining platform via Excel. Perform data cleaning on the raw medical text data to remove specified invalid characters, and then perform classification preprocessing. Based on a pre-defined standard dataset of five types of terms, a text classification model is trained, and the term category to which the term to be standardized belongs is determined by the text classification model. Based on the standard terminology dictionary corresponding to the terminology category (supporting dictionary updates and maintenance), a full match is performed on the terms to be standardized. When the match result is unique, the corresponding standard term node is determined and output. When the match result is not unique or the match is unsuccessful, it is determined whether it belongs to the case where standardization is not required. If so, the unstandardized identifier and related explanation are directly output for review by the medical manager. If not, it is processed through two parallel routes: Route 1 calls a three-level parallel matching model to calculate the semantic similarity between the candidate standard terms and the terms to be standardized, obtains a list of candidate words and corresponding similarity scores, and selects the high-scoring candidate terms as standard terms; Route 2 processes the text mapping through a deep model to obtain the corresponding standard terms. The standard terms output by the two parallel routes are comprehensively evaluated to determine the unique standard term node. The standard term sets are mapped to standard term nodes, generating standardized term sets containing standardized names, standardized codes, similarity scores, and term category information. These sets can be exported from the data element mining platform and imported into business systems.

[0007] Furthermore, data cleaning operations are performed on the original medical text data to remove specified invalid characters, including: Data cleaning operations include removing punctuation marks (excluding question marks and parentheses), special characters (excluding plus and minus signs), and whitespace characters from the beginning and end of strings to ensure the standardization of text data and avoid invalid characters interfering with the subsequent matching process.

[0008] Furthermore, the classification preprocessing operations include specific replacement rules and splitting rules for the five categories of terms: diagnosis, surgery, laboratory, examination, and pharmaceuticals, adapting to the expression characteristics of each type of terminology. Specifically: For diagnostic terminology, common abbreviations and synonyms are uniformly replaced, such as replacing "myocardial infarction" with "myocardial infarction". Compound diagnostic expressions are reasonably broken down according to the dimensions of etiology and symptoms. For surgical terminology, standardize the description of surgical procedures, break down key elements such as surgical site, procedure, and approach, and remove redundant modifiers; For testing terminology, standardize the descriptions of testing items and break down information such as testing indicators and sample types; Standardize the descriptions of inspection methods and locations for inspection terminology, and unify abbreviations. For pharmaceutical terminology, standardize the expression of generic names, brand names, and chemical names of drugs, and break down key information such as drug dosage forms and dosages.

[0009] Furthermore, based on a pre-defined standard dataset of five terminology categories, a text classification model is trained. This model is used to define the major terminology categories to which the terms to be standardized belong, including: The five pre-defined categories of terms correspond to the following standard datasets: Diagnosis: National Medical Insurance Information Business Coding Standard Database "ICD-10 Medical Insurance Edition" 20251017 (supplemented by the enterprise itself); Surgery / Procedure: National Medical Insurance Information Business Coding Standard Database "ICD-9 Medical Insurance Version" 20251017 (company's own supplement); Verification: Company's own standard "UniMed-LAB-V1.1" 20251118 (referencing "DB33T903-2013 Classification and Coding of Clinical Laboratory Trial Items"); Check: Enterprise's proprietary standard "UniMed-EQI-V1.0" 20250820 (refer to ICD-11-2025.01 version); Drugs: Company's own standard "UniMed-DRU-V1.0" 20251204 (referencing ATC and medical insurance drugs).

[0010] Five labeled datasets were constructed using standard datasets to train a text classification model. The preprocessed terms to be standardized were classified and determined, accurately defining their major term categories. This provided a range limit for subsequent full matching and improved matching efficiency.

[0011] Furthermore, based on the standard terminology dictionary corresponding to the major term categories, a complete match is performed on the terms to be standardized, including: A complete match means that the term to be standardized is completely identical to the standard term in the standard term dictionary at the character level, with no difference whatsoever. The standard terminology dictionary is stored according to major term categories, including standard term names, standardized codes and related attribute information. It supports manual or automated updates and maintenance to ensure the timeliness and accuracy of the dictionary content. When the term to be standardized completely matches a standard term in the standard term dictionary and the result is unique, the standard term is directly output as the standard term node; when there are multiple completely matching results or no matching results, the subsequent non-standardization judgment and parallel processing flow is entered.

[0012] Further, determining whether a situation does not require standardization includes: Situations where standardization is not required include: diagnostic texts containing question marks indicating doubtful diagnoses; descriptions with unclear surgical sites or procedures; and prescriptions for traditional Chinese medicine containing terms such as "formula granules," "one prescription," or "single herb." Non-standardized labels must clearly state the specific reasons why they do not need to be standardized, so that medical managers can quickly verify them. Examples include "Suspicious diagnosis description, not standardized for the time being" and "Surgical site unclear, not standardized for the time being".

[0013] Furthermore, Route 1 invokes a three-level parallel matching model to calculate the semantic similarity between candidate standard terms and terms to be standardized, obtaining a list of candidate words and their corresponding similarity scores. High-scoring candidate terms are selected as standard terms, including: The three-level parallel matching model includes: The cosine similarity calculation layer constructs a vector space based on the word segmentation set of character n-grams, calculates the cosine value of the angle between the vectors of the term to be standardized and the candidate standard terms, and represents the overlap at the character level; The Levenshtein distance calculation layer calculates the minimum number of character conversions between the term to be standardized and the candidate standard terms through adding, deleting, and modifying characters, quantifies the degree of spelling differences, and is tolerant of spelling errors. The paraphrase-multilingual-MiniLM-L12-v2 semantic matching layer uses a pre-trained cross-lingual lightweight semantic model to map terms to a high-dimensional semantic space, calculate deep semantic similarity, and is compatible with long texts, short texts, cross-lingual expressions, and semantic-level synonym association scenarios. The three-level modules work in parallel and independently, each outputting its corresponding similarity result, ensuring comprehensive coverage of multi-dimensional features.

[0014] The candidate standard terms and the terms to be standardized are respectively input into the modules of the three-level parallel matching model to obtain independent similarity scores for each layer simultaneously. Set a similarity threshold t > 80 (which can be adjusted according to the actual application scenario) to filter out candidate standard terms whose scores at each layer are not lower than the similarity threshold; The candidate standard terms after each layer of screening are deduplicated and merged. The comprehensive similarity score is calculated based on the score weight of each layer. After sorting the comprehensive scores in descending order, the high-scoring candidate terms (preferably the highest comprehensive score among the top 1 or top 2) are selected as the standard terms output for this route. Record the scores and overall scores of each candidate term in each module to form a similarity score list, which will serve as the basis for subsequent comprehensive judgment.

[0015] Furthermore, Route 2 processes the text using a deep model's text mapping to obtain corresponding standard terms, including: The deep model's text mapping uses a pre-trained deep model based on the Transformer architecture, which is fine-tuned on a large amount of medical terminology annotation data and has a powerful terminology mapping capability. This deep model takes the terms to be standardized and their respective term categories as input, and uses the semantic associations and mapping rules of medical terms learned by the model to directly output the corresponding standard terms. During model training, standard datasets of various terms and terminology mapping cases from actual clinical applications are incorporated to ensure the accuracy and clinical applicability of the mapping results.

[0016] Furthermore, a comprehensive evaluation of the standard terms output by the two parallel paths is performed to ultimately determine a unique standard term node, including: When the standard terms output by the two routes are consistent, the standard term is directly used as the final standard term node. When the standard terms output by the two routes are inconsistent, a comprehensive evaluation is conducted by combining the similarity score list of route one and the confidence score of the model of route two, and the term with the higher comprehensive score and more in line with the medical terminology standards and clinical context is selected as the final standard term node. If one route has no valid output, the output of the other route will be used as the final standard term node. The comprehensive judgment process needs to record the output results and judgment basis of both routes to ensure traceability.

[0017] Furthermore, the terms to be standardized are mapped to standard term nodes, generating a standardized term set containing standardized names, standardized codes, similarity scores, and term category information, including: Establish a structured terminology mapping table. The fields of the terminology mapping table include: terminology to be standardized, original text source, data cleaning record, preprocessing record, terminology category, terminology category code, complete match result, no standardization required result, output results of two parallel routes, comprehensive judgment basis, final standard terminology node, standardization name, standardization code, similarity score (route 1), model confidence (route 2), standard dictionary version number, and standardization timestamp. The complete mapping results are written into a standardized terminology set, which supports batch export to meet the data migration needs of the data element mining platform to business systems. For mapping errors discovered after review by medical managers, the relevant data can be fed back to the labeled dataset for iterative optimization of text mapping in text classification models, three-level parallel matching models, and deep models. The standardized terminology set needs to meet the unified retrieval requirements of business systems such as scientific research platforms, and ensure data consistency across systems.

[0018] This application also proposes a multi-level standardization system for medical terminology based on a Hybrid-Scale model, including: The data acquisition module is used to acquire raw medical text data (supports Excel import). The raw medical text data includes content related to five categories of professional terms: diagnosis, surgery, testing, examination, and pharmaceuticals. The data cleaning module is used to perform data cleaning operations on the raw medical text data, remove specified invalid characters, and output the cleaned text data. The classification preprocessing module is used to perform exclusive preprocessing operations such as replacement and splitting on the cleaned text data according to five categories of terms: diagnosis, surgery, testing, examination, and pharmaceuticals. The text classification module is used to construct an labeled dataset based on the standard datasets corresponding to the five categories of terms, train the text classification model, and define the term category and code to which the term to be standardized belongs. The dictionary management and full match module is used to maintain and update the standard terminology dictionary. Based on the standard terminology dictionary corresponding to the terminology category, it performs a full match on the terms to be standardized and outputs a unique match result or multiple match / no match indicator. The unstandardized judgment module is used to determine whether the term to be standardized belongs to the case that does not need to be standardized when the complete matching result is not unique or fails to match. It outputs the unstandardized mark and the reason or the mark that needs further processing. The parallel processing module includes a three-level matching submodule and a text mapping submodule, wherein: The three-level matching submodule is used to call the three-level parallel matching model, calculate the single similarity score and comprehensive score between the term to be standardized and the candidate standard terms, generate a list of candidate words and a score list, and output high-scoring candidate terms. The text mapping submodule is used to process the terms to be standardized that need further processing through the text mapping of the deep model and output the corresponding standard terms. The comprehensive judgment module is used to comprehensively judge the standard terms output by the three-level matching submodule and the text mapping submodule, and determine the unique standard term node by combining relevant scores and confidence information. The terminology mapping and output module is used to establish a terminology mapping relationship table, map the terms to be standardized to the final standard term nodes, generate a standardized terminology set containing complete fields, and support batch export and data backflow optimization.

[0019] This application also proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the above-mentioned Hybrid-Scale model-based multi-level standardization method for medical terminology.

[0020] This application also proposes a non-transitory computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned multi-level standardization method for medical terminology in collaboration with the Hybrid-Scale model.

[0021] Compared with the prior art, this application has the following beneficial effects: This application provides a method, system, electronic device, and non-transitory computer-readable storage medium for multi-level standardization of medical terminology using a Hybrid-Scale model. It strictly follows the process logic of "cleaning-preprocessing-text classification-complete matching-no-standardization judgment-parallel processing (three-level matching + deep model text mapping)-comprehensive judgment-terminology mapping", which is highly consistent with the flowchart in the technical disclosure document. Through a two-level matching and parallel processing mechanism, the system retains the efficiency of full matching while comprehensively covering the terminology standardization needs in different scenarios through multi-dimensional similarity calculation of the three-level parallel matching model and the direct mapping capability of text mapping of the deep model. This precisely balances the accuracy and recall rate of medical terminology standardization. The dedicated standard datasets and modular design for five types of terms enhance the system's adaptability to multiple standard systems and domain scalability, significantly reducing manual maintenance costs. At the same time, through complete process records and comprehensive judgment criteria, the interpretability and traceability of the standardization process are guaranteed, adapting to the needs of high-risk medical scenarios and seamlessly connecting the actual business process of "data import-standardization-review-export". This significantly improves the efficiency and implementation of standardization, providing high-quality standardized data support for scenarios such as cross-system medical data interconnection, clinical research, and medical insurance settlement. Attached Figure Description

[0022] Figure 1 This is a flowchart of the multi-level standardization method for medical terminology using a Hybrid-Scale model as provided in the embodiments of this application; Figure 2 This is a structural diagram of the multi-level standardization system for medical terminology based on the Hybrid-Scale model provided in this application embodiment; Figure 3This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this application clearer, specific embodiments of this application will be described in further detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely for explaining this application and not for limiting it. It should also be noted that, for ease of description, only the parts relevant to this application are shown in the drawings, not all of them. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe operations (or steps) as being processed sequentially, many of these operations can be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the operations can be rearranged. A process can be terminated when its operation is completed, but it may also have additional steps not included in the drawings. A process can correspond to a method, function, procedure, subroutine, subroutine, etc.

[0024] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such use of data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class and the number of objects is not limited; for example, a first object can be one or more. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.

[0025] In traditional healthcare IT infrastructure development, the diversity, regionality, and lack of standardization in medical terminology make it difficult to effectively identify and share medical data, severely impacting interoperability efficiency. Existing standardization methods generally face limitations such as difficulty in balancing accuracy and recall, insufficient robustness to noisy data, strong domain dependence, and difficulty in flexibly adapting to multiple classification standards.

[0026] To address this, this application proposes a multi-level standardization method for medical terminology based on a Hybrid-Scale model, such as... Figure 1 As shown, it includes: Obtain raw medical text data, which includes content related to five categories of professional terms: diagnosis, surgery, testing, examination, and pharmaceuticals. It supports importing data into the data element mining platform via Excel. Perform data cleaning on the raw medical text data to remove specified invalid characters, and then perform classification preprocessing. Based on a pre-defined standard dataset of five types of terms, a text classification model is trained, and the term category to which the term to be standardized belongs is determined by the text classification model. Based on the standard terminology dictionary corresponding to the terminology category (supporting dictionary updates and maintenance), a full match is performed on the terms to be standardized. When the match result is unique, the corresponding standard term node is determined and output. When the match result is not unique or the match is unsuccessful, it is determined whether it belongs to the case where standardization is not required. If so, the unstandardized identifier and related explanation are directly output for review by the medical manager. If not, it is processed through two parallel routes: Route 1 calls a three-level parallel matching model to calculate the semantic similarity between the candidate standard terms and the terms to be standardized, obtains a list of candidate words and corresponding similarity scores, and selects the high-scoring candidate terms as standard terms; Route 2 processes the text mapping through a deep model to obtain the corresponding standard terms. The standard terms output by the two parallel routes are comprehensively evaluated to determine the unique standard term node. The standard term sets are mapped to standard term nodes, generating standardized term sets containing standardized names, standardized codes, similarity scores, and term category information. These sets can be exported from the data element mining platform and imported into business systems.

[0027] For ease of understanding, the following explains some key terms in this embodiment: "Hybrid-Scale" refers to the integration of models and processing paths with different granularities and complexities, including full matching, three-level parallel matching models, and deep model text mapping, in order to address the challenges of the diversity of medical terminology; "multi-level standardization" means that the entire process is gradually advanced through multiple levels such as data cleaning, preprocessing, classification, matching, and judgment, ultimately achieving terminology standardization.

[0028] The data element mining platform is configured as the core environment for data aggregation and management. Its role is to receive raw medical text data and support the export of standardized data to other business systems. The platform provides a unified data interface and management interface for the entire standardized process, ensuring the smoothness and controllability of data flow.

[0029] Text classification models are used to classify preprocessed medical terms. By learning from a large amount of labeled medical terminology data, the model can identify the major professional categories to which the terms to be standardized belong, such as diagnosis, surgery, testing, examination, or pharmaceuticals. This limits the scope for subsequent full matching and parallel processing, improving processing efficiency and accuracy.

[0030] The standard terminology dictionary is configured as a knowledge base storing authoritative medical terms and their codes. This dictionary supports dynamic updates and maintenance to ensure the timeliness and accuracy of its content. During the standardization process, the terminology to be standardized is fully compared with the standard names in this dictionary to achieve an initial, rapid match.

[0031] The three-level parallel matching model is designed to perform deep semantic similarity calculations for terms that cannot be determined by exact matching. The model contains multiple independent matching layers, each evaluating the similarity between the term to be standardized and candidate standard terms from different dimensions, such as character overlap, spelling differences, or deep semantic associations, thereby generating multi-dimensional similarity scores to provide a basis for selecting high-scoring candidate terms.

[0032] The deep model's text mapping is an independent parallel processing route. It uses an advanced deep learning model to directly learn the mapping rules of medical terms, achieving a fast mapping from terms to be standardized to standard terms, which complements the three-level parallel matching model.

[0033] The integrated judgment module is used to integrate and evaluate the output results of the two parallel routes, resolve potential result conflicts, and ensure that the final output is a unique and accurate standard terminology node.

[0034] The standardized terminology set is generated as the final output, containing information such as the standardized name, code, similarity score, and term category for each term to be standardized. This terminology set aims to provide unified medical terminology data for business systems, supporting data interoperability and efficient utilization.

[0035] Raw medical text data can be obtained in various ways. For example, the text content can be manually entered into the system by a human operator; or the system can be configured to automatically extract data periodically from business databases such as Hospital Information System (HIS), Electronic Medical Record System (EMR), or Laboratory Information System (LIS). As a convenient implementation method, data can also be obtained through file import. For example, text data containing relevant content of five categories of professional terms—diagnosis, surgery, laboratory testing, examination, and pharmaceuticals—can be compiled into CSV or TXT files and then uploaded to a data element mining platform.

[0036] After obtaining the raw medical text data, the first step is to perform data cleaning. This cleaning process includes removing punctuation marks (excluding question marks and parentheses), special characters (excluding plus and minus signs), and whitespace characters from the beginning and end of the strings to ensure the text data is standardized and to avoid invalid characters interfering with subsequent processing. For example, "acute pneumonia?" is cleaned to "acute pneumonia?", and "hypertension / grade 2" is cleaned to "hypertension / grade 2", etc.

[0037] After cleaning, classification preprocessing is performed on the text data. This preprocessing applies specific replacement and splitting rules to five categories of terms: diagnosis, surgery, laboratory testing, examination, and pharmaceuticals. For example, the diagnostic term "myocardial infarction" is replaced with "myocardial infarction"; the surgical term "laparoscopic cholecystectomy" is split into "laparoscopy," "cholecystectomy," and "resection"; and the pharmaceutical term "ibuprofen sustained-release capsules (Fenbid)" is replaced with "ibuprofen sustained-release capsules," etc. These operations adapt to the expressive characteristics of various terms, laying the foundation for subsequent text classification and matching.

[0038] To accurately define the major categories of the terms to be standardized, this application uses five pre-defined standard datasets corresponding to five term categories to train a text classification model. These standard datasets include the ICD-10 medical insurance version for diagnostics, the ICD-9 medical insurance version for surgery / procedures, the UniMed-LAB-V1.1 for laboratory tests, the UniMed-EQI-V1.0 for examinations, and the UniMed-DRU-V1.0 for pharmaceuticals. Five labeled datasets are constructed using these standard datasets, and a text classification model is trained using machine learning or deep learning algorithms, such as a model based on the Transformer architecture. After training, the preprocessed terms to be standardized are input into the model, and the model outputs their major term category and code, thus defining the scope for subsequent full matching.

[0039] After determining the major category to which the term to be standardized belongs, the system will perform a full match based on the standard term dictionary corresponding to that major category. The standard term dictionary is stored by major category and contains the standard term name, standardization code, and related attribute information. A full match means that the term to be standardized is completely identical to the standard term in the standard term dictionary at the character level. When the match result is unique, the standard term is directly output as the standard term node; when there are multiple full matches or no match, the system proceeds to the step of determining if standardization is not required.

[0040] The "No Standardization Required" step identifies specific medical text fragments that do not require standardization. These include diagnostic texts containing question marks indicating doubt (e.g., "suspected lung cancer?"), descriptions with unclear surgical sites or procedures (e.g., "abdominal surgery"), and herbal medicine prescriptions containing terms like "formula granules," "one prescription," or "single herb" (e.g., "single-herb astragalus formula granules"). If determined to be non-standardized, a non-standardized identifier and the specific reason are output for the medical manager's review; if determined to require further processing, the process proceeds to two parallel processing routes.

[0041] Route 1 utilizes a three-layer parallel matching model. This model comprises a cosine similarity calculation layer, a Levenshtein distance calculation layer, and a paraphrase-multilingual-MiniLM-L12-v2 semantic matching layer, with each layer operating independently in parallel. Candidate standard terms and terms to be standardized are input into each layer module to obtain similarity scores. A similarity threshold t > 80 is set, and candidate standard terms meeting the threshold are selected. After deduplication and merging, a comprehensive score is calculated. These terms are then sorted in descending order of their comprehensive scores, and high-scoring candidate terms are selected as the standard terms output for this route. The score list is recorded.

[0042] Route two utilizes text mapping through a deep model. This deep model employs a pre-trained model based on the Transformer architecture, fine-tuned with extensive medical terminology annotation data, and possesses powerful terminology mapping capabilities. The term to be standardized and its corresponding terminology category are input into the model, which directly outputs the corresponding standardized term and provides a confidence score.

[0043] After the two parallel routes are processed, the comprehensive judgment stage begins. If the standard terms output by the two routes are consistent, they are directly used as the final standard term node. If the outputs are inconsistent, a comprehensive evaluation is performed, combining the similarity score list from Route 1 and the confidence score from Route 2, to select the superior term as the final result. If one route has no valid output, the output of the other route takes precedence. The comprehensive judgment process must be documented to ensure traceability.

[0044] Finally, a mapping relationship is established between the terms to be standardized and the finalized standard term nodes. Information such as the terms to be standardized, original text sources, processing records from each stage, standardized names, standardized codes, similarity scores, and model confidence levels are integrated to generate a structured standardized term set. This term set supports batch export, meeting the data migration needs of the data element mining platform to other business systems and ensuring the consistency and interoperability of medical data across different systems.

[0045] This application optimizes the methodological process architecture by strictly following the flowchart logic in the technical disclosure, adds text mapping of deep models as a parallel route, improves the logical connection of each link, effectively solves the problems of terminology diversity and non-standardization in traditional medical terminology standardization, as well as the limitations of existing methods in terms of accuracy, recall, robustness and scalability, and significantly improves the efficiency and utilization value of medical data interconnection across systems.

[0046] In some implementations, data cleaning operations are performed on the raw medical text data to remove specified invalid characters, including: Data cleaning operations include removing punctuation marks (excluding question marks and parentheses), special characters (excluding plus and minus signs), and whitespace characters from the beginning and end of strings to ensure the standardization of text data and avoid invalid characters interfering with the subsequent matching process.

[0047] Specifically, when removing punctuation marks at the beginning and end of a string, regular expressions are used to match punctuation marks (such as commas, periods, semicolons, etc.) at the start and end of the string and then replace and remove them. Question marks and parentheses are added to a whitelist and not removed, in order to preserve symbols in medical text that indicate suspected diagnoses or contain key information. For example, regular expressions are used to match and remove punctuation marks that are not on the whitelist at the beginning and end.

[0048] When removing special characters, a predefined set of special characters is used. The string is traversed, and special characters not in the retention list (plus and minus signs) are removed to prevent the removal of valid elements used in dose or range descriptions. For example, the "+" sign in "blood glucose +5.6mmol / L" is retained, and only other irrelevant special characters are removed.

[0049] When removing whitespace characters, use built-in string trimming methods (such as trim() or strip()) to remove spaces, tabs, newlines, etc. from both ends of the string. For consecutive whitespace characters in the middle of the string, they can be replaced with a single space to ensure consistent text formatting. For example, process "blood routine examination" into "blood routine examination".

[0050] In some implementations, the classification preprocessing operation includes specific replacement rules and splitting rules executed for the five categories of terms: diagnosis, surgery, laboratory, examination, and pharmaceuticals, adapting to the expression characteristics of each type of terminology. Specifically: For diagnostic terms, a dictionary of common abbreviations and synonyms is constructed, such as replacing "DM" with "diabetes" and "COPD" with "chronic obstructive pulmonary disease"; compound diagnostic expressions are broken down according to the dimensions of etiology, symptoms, and location, such as breaking down "hypertensive heart disease (NYHA Class II)" into "hypertensive heart disease" and "NYHA Class II", which facilitates subsequent classification and matching.

[0051] To standardize surgical terminology, we will standardize the description of surgical procedures and unify the standard spelling of surgical operation names, such as standardizing "cholecystectomy" as "cholecystectomy surgery". We will break down key elements such as surgical site, surgical procedure, approach, and anesthesia method, such as breaking down "open cholecystectomy under general anesthesia" into "general anesthesia", "open", "cholecystectomy", and "removal". We will remove redundant modifiers, retain core elements, and improve matching accuracy.

[0052] To address laboratory terminology, standardized replacement rules for laboratory item names were established to unify the expression habits of different medical institutions. For example, "blood routine test" was replaced with "blood routine test". Information such as laboratory indicators, sample types, and detection methods were broken down. For example, "serum creatinine determination (enzymatic method)" was broken down into "serum", "creatinine", "determination", and "enzymatic method" to facilitate accurate matching with entries in the standard terminology dictionary.

[0053] Regarding examination terminology, standardize the descriptions of examination methods, locations, and purposes, and unify abbreviations, such as retaining "CT" and standardizing "magnetic resonance imaging (MRI)" for "nuclear magnetic resonance". Break down key elements of the examination, such as breaking down "chest computed tomography (plain scan)" into "chest", "computed tomography", and "plain scan" to clarify the core information of the examination.

[0054] For pharmaceutical terminology, a dictionary mapping drug names to their corresponding terms will be constructed to standardize the expression of generic names, brand names, and chemical names, such as replacing "Tylenol" with "acetaminophen sustained-release tablets"; key information such as drug dosage form, dosage, and specifications will be broken down, such as "amoxicillin capsules 0.5g" The phrase "12 tablets" is broken down into "Amoxicillin", "Capsule", "0.5g", and "12 tablets" to ensure consistency with the drug information dimensions in the standard terminology dictionary.

[0055] In some implementations, a text classification model is trained based on a pre-defined standard dataset of five terminology categories. This model is then used to define the major terminology categories to which the terminology to be standardized belongs, including: The five pre-defined standard datasets corresponding to the terms are: for diagnosis, the national medical insurance information business coding standard database "ICD-10 Medical Insurance Version" 20251017 (company-owned supplement); for surgery / operation, "ICD-9 Medical Insurance Version" 20251017 (company-owned supplement); for testing, the company-owned standard "UniMed-LAB-V1.1" 20251118 (referencing "DB33T903-2013 Classification and Coding of Clinical Laboratory Trial Items"); for examination, "UniMed-EQI-V1.0" 20250820 (referencing ICD-11-2025.01); and for drugs, "UniMed-DRU-V1.0" 20251204 (referencing ATC and medical insurance drugs).

[0056] Based on the aforementioned standard dataset, five types of labeled datasets were constructed. Each data entry in the labeled dataset contains terminology text and its corresponding terminology category label (diagnosis, surgery, testing, examination, and pharmaceuticals). To improve the model's generalization ability, data augmentation processing can be performed on the labeled data, such as synonym replacement, word order adjustment, and adding noisy data.

[0057] We selected pre-trained language models based on the Transformer architecture (such as BERT and RoBERTa) as the base model and fine-tuned them on the constructed labeled dataset. During training, we used the cross-entropy loss function to optimize the model parameters, enabling the model to learn the semantic features and expression patterns of different categories of terms.

[0058] After training, the preprocessed terms to be standardized are input into the text classification model. The model outputs the probability distribution of the term to each category. The category with the highest probability is selected as the term category to which the term belongs, and the corresponding term category code is output. This provides a clear range limit for subsequent full matching and avoids efficiency reduction and mismatch problems caused by cross-category matching.

[0059] In some implementations, a complete match is performed on the terms to be standardized based on a standard terminology dictionary corresponding to the terminology category, including: The standard terminology dictionary is stored according to major term categories. Each category contains information such as standard term names, standardized codes, and term attributes (e.g., etiology and symptom attributes for diagnostic terms, dosage form and dosage attributes for pharmaceutical terms). It supports manual addition, modification, and deletion of terms through the dictionary management interface, and also supports the synchronous updating of new terms in the standard dataset through automated scripts to ensure the timeliness and accuracy of the dictionary content.

[0060] During the full match process, the term to be standardized is compared precisely with the standard term name in the corresponding major category standard term dictionary, allowing no character differences. For example, if the term to be standardized, "acute myocardial infarction," is completely identical to the standard term "acute myocardial infarction" in the dictionary, the match is considered successful; if the term to be standardized, "acute myocardial infarction," has character differences from "acute myocardial infarction" in the dictionary, the match is considered unsuccessful.

[0061] When the term to be standardized completely matches a standard term in the standard term dictionary and the result is unique, the standardized name, standardized code, and other information of the standard term are directly extracted, the corresponding standard term node is determined and output; when the term to be standardized completely matches multiple standard terms in the dictionary (such as the case where the term names are exactly the same but the attributes are different) or no standard term completely matches it, the complete matching process is stopped, and the subsequent non-standardization judgment and parallel processing process is entered.

[0062] In some implementations, determining whether a situation does not require standardization includes: Situations where standardization is not required include: question marks in diagnostic texts indicating suspected diagnoses; unclear descriptions of surgical sites or procedures; and prescriptions for traditional Chinese medicine containing terms such as "formula granules," "one prescription," or "single herb."

[0063] The identification of suspicious statements containing question marks in diagnostic texts is achieved by detecting whether the text contains the "?" symbol. Simultaneously, the use of question marks in non-suspicious scenarios (such as annotations within parentheses) is excluded, taking into account the clinical context. For example, "lung cancer?", "suspected gastric cancer?", and "gastric polyp (nature to be determined?)" are all identified as suspicious statements.

[0064] The identification of unclear surgical sites or procedures is achieved by constructing a keyword list (such as "site unknown", "procedure to be determined", "unclear", "to be determined", etc.) and performing keyword matching on the surgical terms to be standardized. If the terms contain keywords from the list, they are determined to be unclear in terms of site or procedure. For terms that do not contain keywords but are vague in expression (such as "abdominal surgery" or "orthopedic surgery"), a text classification model is used to assist in the judgment. If the model cannot determine the specific site or procedure, it is also determined to be a case that does not require standardization.

[0065] The identification of Chinese herbal medicine prescriptions containing the terms "formula granules," "prescription," or "single herb" in their names is achieved through precise matching of the keywords "formula granules," "prescription," and "single herb." If a term contains any of these keywords, it is determined to be a case where standardization is not required. For example, "Astragalus formula granules," "Chinese herbal prescription," and "single-herb Angelica sinensis" are all determined to be cases where standardization is not required.

[0066] Non-standardized labels must clearly state the specific reasons why they do not need to be standardized, and use a uniform format, such as "Suspected diagnosis: Acute appendicitis?", "Unclear surgical site: Pelvic surgery", "Traditional Chinese medicine prescription: Single-herb licorice formula granules", so that medical managers can quickly understand the reasons for non-standardization and improve review efficiency.

[0067] In some implementations, Route 1 invokes a three-level parallel matching model to calculate the semantic similarity between candidate standard terms and terms to be standardized, obtaining a list of candidate words and corresponding similarity scores. High-scoring candidate terms are then selected as standard terms, including: The three-level parallel matching model includes a cosine similarity calculation layer, a Levenshtein distance calculation layer, and a paraphrase-multilingual-MiniLM-L12-v2 semantic matching layer. The three layers work independently in parallel, each completing similarity calculation and outputting results, ensuring that the similarity relationships between terms are fully captured from the character level, spelling level, and deep semantic level.

[0068] The cosine similarity calculation layer constructs a vector space based on the character n-gram tokenization sets, converts the term to be standardized and the candidate standard term into character n-gram vectors respectively (the value of n can be adjusted according to the term length, such as n = 2 or n = 3), and calculates the cosine value of the angle between the two vectors. The closer the cosine value is to 1, the higher the character-level overlap degree of the two terms. For example, the character n-gram vectors of the term to be standardized "myocardial infarction" and the candidate standard term "acute myocardial infarction" have a high overlap degree, and the cosine similarity score is high.

[0069] The Levenshtein distance calculation layer calculates the minimum number of times to convert between the term to be standardized and the candidate standard term through character addition, deletion, and modification operations. The fewer the number of times, the smaller the spelling difference between the two terms and the higher the similarity. For example, the Levenshtein distance between "myocardial infarction" and "myocardial infarction" is 1, the spelling difference is small, and the similarity score is high; the Levenshtein distance between "common cold" and "gastrointestinal cold" is large, the spelling difference is large, and the similarity score is low. To facilitate comparison with the similarity scores of other levels, the Levenshtein distance is converted into a similarity score, and the conversion formula is: similarity score = 100 - (Levenshtein distance / length of the longer term) × 100, ensuring that the score range is between 0 and 100.

[0070] The paraphrase-multilingual-MiniLM-L12-v2 semantic matching layer uses a pre-trained cross-lingual lightweight semantic model. The term to be standardized and the candidate standard term are input into the model respectively. The model outputs the vector representations of the two terms in the high-dimensional semantic space, calculates the cosine similarity between the vectors, and obtains the deep semantic similarity score. This layer can effectively capture the synonymous associations and semantically similar relationships between terms, is not affected by character-level differences, and is compatible with long text, short text, and cross-lingual expression scenarios. For example, although there are large character differences between "heart attack" and "myocardial infarction", their semantics are exactly the same, and this layer will give a high similarity score.

[0071] The candidate standard term and the term to be standardized are input into each layer module of the three-level parallel matching model respectively, and the independent similarity scores of each layer are obtained synchronously. Set the similarity threshold t > 80, and filter out the candidate standard terms with scores not lower than 80 in each layer to ensure that the candidate term and the term to be standardized have a high similarity.

[0072] The candidate standard terms selected at each level are deduplicated and merged to remove duplicate candidate terms. A comprehensive similarity score is calculated based on the weighted scores of each level. The weight allocation can be empirically set according to the performance of each level in matching different categories of terms (e.g., 0.3 for character level, 0.2 for spelling level, and 0.5 for semantic level), or the weights can be optimized by learning from a labeled dataset using a machine learning model. The comprehensive score calculation formula is: Comprehensive Score = Cosine Similarity Score × w1 + Levenshtein Distance Similarity Score × w2 + Semantic Matching Score × w3 (where w1 + w2 + w3 = 1).

[0073] After sorting the candidate terms in descending order of their overall scores, the highest-scoring candidate terms are selected as the standard terms for the output of that route, with the top-scoring candidate term being given priority. If the difference in overall scores between the top-1 and top-2 candidate terms is less than a preset threshold (e.g., 5 points), both candidate terms are retained for further evaluation in subsequent comprehensive judgment stages. The individual scores and overall scores of each candidate term in each module are recorded to form a similarity score list, which serves as an important basis for subsequent comprehensive judgment.

[0074] In some implementations, route two is processed through text mapping of a deep model to obtain corresponding standard terms, including: The deep model's text mapping employs a pre-trained deep model based on the Transformer architecture (such as GPT-3.5, LLaMA2, etc.). Through fine-tuning on a large amount of medical terminology annotation data, it acquires the capability to map medical terms. The annotation data includes samples of terms to be standardized, their respective terminology categories, and corresponding standard terms, covering various expression scenarios and variations of terms in five categories: diagnosis, surgery, laboratory testing, examination, and pharmaceuticals.

[0075] During model fine-tuning, the terms to be standardized and their respective term categories are used as input sequences, and the corresponding standard terms are used as output sequences. An autoregressive language modeling approach is employed to train the model, enabling it to learn the mapping rules and semantic relationships from non-standard terms to standard terms. During training, a cross-entropy loss function, combined with a learning rate scheduling strategy, is used to optimize model parameters and improve the accuracy of the model's mapping.

[0076] To enhance the clinical applicability of the model, terminology mapping cases from real-world clinical applications are incorporated into the training data, including common abbreviation mappings, synonym mappings, and mappings of non-standard terminology, enabling the model to adapt to the characteristics of terminology in real clinical environments.

[0077] After the model is trained, the term to be standardized and the term categories determined by the text classification model are taken as input. The model outputs the corresponding standard term and provides a confidence score (0-100) for the mapping result. The higher the confidence score, the stronger the reliability of the mapping result. For example, if the term to be standardized is "myocardial infarction" and the term category is "diagnosis", the model outputs the standard term "acute myocardial infarction" with a confidence score of 95.

[0078] In some implementations, the standard terms output by the two parallel paths are comprehensively evaluated to ultimately determine a unique standard term node, including: The comprehensive judgment process first obtains the standard terms (or candidate term list) and similarity score list output by Route 1, and the standard terms and confidence scores output by Route 2.

[0079] When the standard terms output by the two routes are consistent, the standard term is directly used as the final standard term node, and the score information of the two routes is recorded to enhance the credibility of the results.

[0080] When the standard terms output by the two routes are inconsistent, the evaluation is based on a combination of the overall similarity score of Route 1 and the confidence score of Route 2. Score weights are assigned, such as a weight of 0.6 for the overall similarity score of Route 1 and a weight of 0.4 for the confidence score of Route 2. The overall score of the two routes is calculated as follows: Overall Score = Overall Similarity Score of Route 1 × 0.6 + Confidence Score of Route 2 × 0.4. The standard terminology output by the route with the higher overall score is selected as the final standard terminology node. If the difference in overall scores is less than a preset threshold (e.g., 3 points), a manual assisted judgment is made, combining medical terminology standards and clinical context, and the medical manager selects the most suitable standard terminology.

[0081] If one of the routes has no valid output (e.g., route one did not filter out candidate terms that meet the threshold requirements, or route two did not output reasonable standard terms), then the output of the other route will be used as the final standard term node, and the route with no valid output and the reason will be noted in the judgment criteria.

[0082] The comprehensive judgment process requires detailed recording of the output results of the two routes, scores for each item, weight allocation, and the final judgment reason, forming a comprehensive judgment report to ensure the traceability of the standardized process and facilitate subsequent problem investigation and model optimization.

[0083] In some implementations, the terms to be standardized are mapped to standard term nodes to generate a standardized term set containing standardized names, standardized codes, similarity scores, and term category information, including: A structured terminology mapping table is established, which adopts a relational database table structure. The fields include the term to be standardized, the original text source (such as electronic medical record system, laboratory information system), data cleaning records (such as removed invalid characters), preprocessing records (such as replacement and splitting operations), term category, term category code, complete match result (match success / failure, matched standard term name), no standardization required result (yes / no, reason for judgment), Route 1 output result (standard term name, similarity score at each level, comprehensive score), Route 2 output result (standard term name, confidence score), comprehensive judgment basis (score calculation, manual judgment explanation), final standard term node, standardized name, standardized code, similarity score (Route 1 comprehensive score), model confidence (Route 2 score), standard dictionary version number, and standardized timestamp.

[0084] The processing results from each of the above steps are entered into the terminology mapping table according to the field requirements to form a complete mapping record. All mapping records are then summarized and written into a standardized terminology set. The standardized terminology set supports batch export in multiple formats, such as Excel, CSV, and JSON. Users can select the required format through the export function of the data element mining platform to meet the data migration needs of different business systems (such as clinical research platforms and medical insurance settlement systems).

[0085] For mapping errors discovered after review by the medical manager, the erroneous mapping records are marked through the review interface, and the correct standard terminology information is filled in. The relevant error data and correction information can be fed back into the labeled dataset for iterative optimization of text mapping in text classification models, three-level parallel matching models, and deep models. During the optimization process, erroneous cases are used as key training samples to fine-tune model parameters and improve the model's accuracy in handling terms from similar scenarios.

[0086] The standardized terminology set needs to meet the unified retrieval requirements of business systems such as scientific research platforms, ensuring that different expressions of the same medical concept can be mapped to the same standard terminology node when searching across systems, realizing consistent querying and statistical analysis of cross-system data, breaking down data silos, and enhancing the utilization value of medical data.

[0087] like Figure 2 As shown, this application also proposes a multi-level standardization system for medical terminology based on a Hybrid-Scale model. This system achieves efficient and accurate terminology mapping through multi-module collaboration, strictly adhering to flowchart logic. The system includes: The data acquisition module is used to acquire raw medical text data (supports Excel import). The raw medical text data includes content related to five categories of professional terms: diagnosis, surgery, testing, examination, and pharmaceuticals. The data cleaning module is used to perform data cleaning operations on the raw medical text data, remove specified invalid characters, and output the cleaned text data. The classification preprocessing module is used to perform exclusive preprocessing operations such as replacement and splitting on the cleaned text data according to five categories of terms: diagnosis, surgery, testing, examination, and pharmaceuticals. The text classification module is used to construct an labeled dataset based on the standard datasets corresponding to the five categories of terms, train the text classification model, and define the term category and code to which the term to be standardized belongs. The dictionary management and full match module is used to maintain and update the standard terminology dictionary. Based on the standard terminology dictionary corresponding to the terminology category, it performs a full match on the terms to be standardized and outputs a unique match result or multiple match / no match indicator. The unstandardized judgment module is used to determine whether the term to be standardized belongs to the case that does not need to be standardized when the complete matching result is not unique or fails to match. It outputs the unstandardized mark and the reason or the mark that needs further processing. The parallel processing module includes a three-level matching submodule and a text mapping submodule, wherein: The three-level matching submodule is used to call the three-level parallel matching model, calculate the single similarity score and comprehensive score between the term to be standardized and the candidate standard terms, generate a list of candidate words and a score list, and output high-scoring candidate terms. The text mapping submodule is used to process the terms to be standardized that need further processing through the text mapping of the deep model, and output the corresponding standard terms and confidence scores. The comprehensive judgment module is used to comprehensively judge the standard terms output by the three-level matching submodule and the text mapping submodule, and determine the unique standard term node by combining relevant scores and confidence information. The terminology mapping and output module is used to establish a terminology mapping relationship table, map the terms to be standardized to the final standard term nodes, generate a standardized terminology set containing complete fields, and support batch export and data backflow optimization.

[0088] like Figure 3 As shown, this application also proposes an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the above-described Hybrid-Scale model-based multi-level standardization method for medical terminology.

[0089] Specifically, the electronic device may be a hardware device with data processing capabilities, such as a server, workstation, or personal computer, serving as a hardware platform for carrying and running the Hybrid-Scale model collaborative multi-level standardization method for medical terminology.

[0090] The memory is used to store various types of data required during the execution of the computer program and the method, including raw medical text data, cleaned text data, standard terminology dictionary data, labeled datasets, model parameters, intermediate calculation results, terminology mapping tables, and standardized terminology sets. The memory can be random access memory (RAM), read-only memory (ROM), solid-state drive (SSD), hard disk drive (HDD), or a hybrid storage device to ensure efficient data access and persistent storage.

[0091] The processor is the core computing unit of the electronic device, responsible for parsing and executing computer program instructions stored in the memory. The processor can be a central processing unit (CPU), a graphics processing unit (GPU), a dedicated AI acceleration chip, or a multi-core processor, etc., possessing powerful parallel computing and data processing capabilities to meet the needs of computationally intensive tasks such as text classification, similarity calculation, and deep model inference.

[0092] The computer program is a set of instructions for implementing the multi-level standardization method for medical terminology using the Hybrid-Scale model. It encodes the complete logical flow and algorithm of the method, including operational instructions for each stage such as data acquisition, cleaning, preprocessing, classification, matching, judgment, mapping, and output. When the processor executes this program, it can automatically complete the standardization process of medical terminology without manual intervention, improving standardization efficiency and accuracy.

[0093] This application also proposes a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the above-described Hybrid-Scale model-based multi-level standardization method for medical terminology.

[0094] Specifically, the non-transitory computer-readable storage medium refers to a physical storage device capable of storing data and programs for a long period of time, retaining the data even after power is lost. Examples include hard disk drives (HDDs), solid-state drives (SSDs), read-only memory (ROMs), flash memory, optical discs (CD-ROMs, DVD-ROMs), Blu-ray discs, or USB flash drives. This medium ensures the persistence and stability of computer programs, facilitating long-term storage, distribution, and cross-platform deployment.

[0095] The computer program is a set of instructions that directs the computer to perform a specific task. This computer program can exist in the form of compiled binary code, interpreted scripting language code (such as Python code), or bytecode, and it encodes the complete logic of a multi-level standardization method for medical terminology using a Hybrid-Scale model. When the computer program is executed by a processor, it can automatically implement all steps of the above method, including data cleaning, preprocessing, text classification, exact matching, non-standardization decision-making, parallel processing, comprehensive decision-making, and terminology mapping, outputting a standardized terminology set to meet the practical needs of medical data standardization.

[0096] The above description is merely a preferred embodiment and the technical principles employed in this application. This application is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions that can be made by those skilled in the art will not depart from the scope of protection of this application. Therefore, although this application has been described in detail through the above embodiments, this application is not limited to the above embodiments, and may include many other equivalent embodiments without departing from the concept of this application. The scope of this application is determined by the scope of the claims.

Claims

1. A multi-level standardization method for medical terminology based on Hybrid-Scale model collaboration, characterized in that, It should include at least the following steps: S1. Obtain raw medical text data, which includes content related to five categories of professional terms: diagnosis, surgery, testing, examination, and pharmaceuticals. S2. Perform data cleaning on the original medical text data to remove specified invalid characters, and then perform classification preprocessing according to the five categories of terms respectively; S3. Based on the preset five-category standard dataset of terms, train a text classification model, and use the text classification model to define the term category to which the term to be standardized belongs; S4. Based on the standard term dictionary corresponding to the term category, perform a full match on the term to be standardized. When the matching result is unique, determine the corresponding standard term node. When the matching result is not unique or the match is unsuccessful, determine whether it belongs to the case where standardization is not required. If so, output the unstandardized identifier directly. If not, proceed to the next step. S5. Processing through two parallel routes: Route 1 is to call a three-level parallel matching model to calculate the semantic similarity between the candidate standard terms and the terms to be standardized, and obtain a list of candidate words and corresponding similarity scores; Route 2 is to process the text through a deep model to obtain the corresponding standard terms and confidence scores. S6. Make a comprehensive judgment on the output results of the two parallel routes, and determine the unique standard term node by combining the similarity score and the confidence score. S7. Map the terms to be standardized to the standard term nodes to generate a standardized term set containing standardized names, standardized codes, similarity scores, confidence scores, and comprehensive judgment criteria.

2. The multi-level standardization method for medical terminology according to claim 1, characterized in that, In step S2, data cleaning is performed on the original medical text data to remove specified invalid characters, and then classification preprocessing is performed according to five categories of terms, including: The data cleaning operation includes removing punctuation marks, special characters, and whitespace characters from the beginning and end of strings; the classification preprocessing includes replacement and splitting rule operations performed on five categories of terms: diagnosis, surgery, testing, examination, and medicine; the situations where standardization is not required include: diagnostic texts with question marks indicating doubtful diagnoses, descriptions with unclear surgical sites or procedures, and medical orders containing formula granules, one prescription, or a single herb.

3. The multi-level standardization method for medical terminology according to claim 1, characterized in that, In step S3, a text classification model is trained based on a pre-defined standard dataset of five types of terms. This model then defines the major term categories to which the terms to be standardized belong, including: A text classification model is trained based on a pre-defined standard dataset of five types of terms. The five categories of terms correspond to the following standard datasets: diagnostics corresponds to "ICD-10 Medical Insurance Version", surgery / operation corresponds to "ICD-9 Medical Insurance Version", laboratory tests correspond to "UniMed-LAB-V1.1", examinations correspond to "UniMed-EQI-V1.0", and pharmaceuticals correspond to "UniMed-DRU-V1.0". The text classification model trained on the standard datasets is used to classify the preprocessed terms to be standardized and define their major term categories.

4. The multi-level standardization method for medical terminology according to claim 1, characterized in that, The three-layer parallel matching model includes: a cosine similarity calculation layer, used to calculate the character-level similarity between the term to be standardized and the candidate standard terms based on character n-gram vectorization; a Levenshtein distance calculation layer, used to calculate the character edit distance similarity between the term to be standardized and the candidate standard terms, representing the degree of spelling difference; and a paraphrase-multilingual-MiniLM-L12-v2 semantic matching layer, used to calculate the deep semantic similarity between the term to be standardized and the candidate standard terms using a pre-trained cross-language semantic model, compatible with long texts, short texts, and cross-language scenarios; the three layers work independently in parallel and output similarity results separately. The text mapping uses a pre-trained deep model based on the Transformer architecture, which is fine-tuned using medical terminology annotation data to directly output the corresponding standard terms and confidence scores.

5. The multi-level standardization method for medical terminology according to claim 4, characterized in that, The S5 process calls a three-level parallel matching model to obtain a list of candidate words and corresponding similarity scores, including: The candidate standard terms and the terms to be standardized are respectively input into the modules of the three-level parallel matching model to obtain the similarity scores of each layer; Set a similarity threshold t > 80, and filter out candidate standard terms whose scores at each layer are not lower than the threshold; The candidate standard terms after each layer of screening are deduplicated and merged. After sorting by comprehensive score, the top N candidate terms are selected to generate a candidate word list, where N is a positive integer. Record the scores of each candidate term in each module and the overall score to form a similarity score list.

6. The multi-level standardization method for medical terminology according to claim 1, characterized in that, The S6 step involves a comprehensive determination of the output results of the two parallel paths, including: When the standard terms output by the two routes are consistent, the standard term is directly used as the target standard term node. When the standard terms output by the two routes are inconsistent, the comprehensive similarity score of route one and the confidence score of route two are combined, and a comprehensive score is calculated according to a preset weight. The standard term with the highest comprehensive score is selected as the target standard term node. If one of the routes has no valid output, then the output of the other route is used as the target standard term node; The Qwen3-Next-80B-A3B-Instruct model is selected to assist in the judgment, and the judgment criteria are output based on medical terminology standards and clinical context matching principles.

7. The multi-level standardization method for medical terminology according to claim 1, characterized in that, In step S7, the terminology to be standardized is mapped to the standard terminology node to generate a standardized terminology set, including: Establish a terminology mapping table, which includes the terminology to be standardized, terminology category, candidate word list, similarity score of each candidate word, text mapping result and confidence score, comprehensive judgment criteria, target standard terminology node, standardization code and standardization timestamp. The mapping results are written into a standardized terminology set, which also includes terminology category codes and standard dictionary version numbers. When multiple candidate standard terms exist, the comprehensive judgment result is used as the final basis, and the accuracy of the mapping is verified by combining the similarity score and the confidence score.

8. A multi-level standardization system for medical terminology based on a hybrid-scale model, characterized in that, include: The data acquisition module is used to acquire raw medical text data, which includes content related to five categories of professional terms: diagnosis, surgery, testing, examination, and pharmaceuticals. The data cleaning and preprocessing module is used to perform data cleaning operations on the raw medical text data to remove specified invalid characters, and to perform classification preprocessing according to five categories of terms; The text classification module is used to define the term category to which the term to be standardized belongs by using a text classification model trained on a pre-defined standard dataset. The complete matching and determination module is used to perform a complete match based on the standard term dictionary corresponding to the term category, determine the unique standard term node or determine whether it belongs to the case where standardization is not required. The parallel processing module includes a three-level matching submodule and a text mapping submodule. The three-level matching submodule is used to call the three-level parallel matching model to generate a candidate word list and a similarity score list. The text mapping submodule is used to output standard terms and confidence scores through the text mapping of the deep model. The comprehensive judgment module is used to comprehensively judge the output results of the parallel processing module and determine the unique standard terminology node. The terminology mapping module is used to establish a terminology mapping relationship table, map the terms to be standardized to standard term nodes, and generate a standardized terminology set.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the multi-level standardization method for medical terminology in collaboration with the Hybrid-Scale model as described in any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the multi-level standardization method for medical terminology in collaboration with the Hybrid-Scale model as described in any one of claims 1 to 7.