A DRG analysis system based on medical data

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a DRG analysis system, the problems of data deviation and nonlinear feature capture in existing DRG grouping methods have been solved, achieving efficient medical insurance settlement and data cleaning, improving grouping accuracy and data consistency, and possessing adaptive adjustment capabilities, thus meeting the precise needs of medical insurance management.

CN122245593APending Publication Date: 2026-06-19THE FIRST AFFILIATED HOSPITAL OF XIAN MEDICAL UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: THE FIRST AFFILIATED HOSPITAL OF XIAN MEDICAL UNIV
Filing Date: 2026-04-17
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing DRG grouping methods rely on manual review or simple static rule matching, which makes it difficult to handle large-scale, multi-dimensional medical record data, resulting in biased grouping results. They also fail to deeply capture non-linear data characteristics, lack efficient correlation and standardized cleaning of multi-source heterogeneous data, and cannot meet the needs of precise medical insurance management.

Method used

By constructing a DRG analysis system based on medical data, an automated data interface is used to extract and perform format normalization processing, build an associated data model, process the weight matrix and perform cost standard mapping, predict surplus risk, generate a complication mapping table, implement machine learning intelligent classification, and achieve deep cleaning and adaptive adjustment of heterogeneous data.

Benefits of technology

It significantly improves grouping accuracy and data consistency, enhances medical insurance settlement efficiency and risk control capabilities, deeply optimizes medical data quality, possesses excellent intelligent adaptability and scalability, and meets the real-time and scalability requirements of modern digital medical management.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245593A_ABST

Patent Text Reader

Abstract

This invention relates to the fields of medical IoT and medical data analysis technology, and discloses a DRG analysis system based on medical data, aiming to solve the problems of encoding ambiguity, format conflicts, and grouping bias in heterogeneous medical data sources. The system includes: extracting raw medical data and performing format normalization; associating preprocessed grouped data with medical record front page data through right outer join logic to construct a data model; performing structured transformation on the weight matrix and establishing a cost standard mapping function; performing difference calculation based on the cost base to predict surplus risk and label basic disease groups; generating a complication mapping table and performing one-dimensional deep data cleaning; and using clustering algorithms to implement machine learning intelligent classification to construct an adaptive indicator system. Through the above solutions, this application improves grouping accuracy and consistency, achieves real-time quantitative risk prediction, effectively identifies abnormal settlements, and corrects logical conflicts.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of medical Internet of Things and medical data analysis technology, specifically a DRG analysis system based on medical data. Background Technology

[0002] With the deepening of healthcare system reform, Diagnosis Related Groups (DRG) payment has become a core tool for medical insurance fund management and hospital performance evaluation. DRG payment determines payment standards by comprehensively considering factors such as patient diagnostic information, surgical procedures, and individual characteristics, aiming to standardize clinical behavior, control medical costs, and improve the efficiency of medical resource utilization. Against this backdrop, the efficient extraction and refined analysis of medical big data have become a key foundation for achieving scientific grouping, and are of great significance for ensuring the fairness and accuracy of medical insurance settlement.

[0003] Among them, DRG analysis systems based on medical data integrate medical record information with medical insurance settlement data to achieve accurate modeling of medical behavior and cost prediction. These systems typically involve preprocessing massive amounts of heterogeneous medical data, standardizing and converting encoding, and automating the execution of grouping logic, aiming to optimize the medical insurance settlement process through digital means. With the evolution of information technology, how to utilize advanced data mining techniques and automated processing mechanisms to conduct correlation analysis on complex diagnostic and treatment data has become an important research direction for improving the management level of medical institutions and the effectiveness of medical insurance governance.

[0004] However, existing technologies still face numerous challenges in processing DRG grouping. Traditional DRG grouping methods often rely on manual review or simple static rule matching. When faced with large-scale, multi-dimensional medical record data, errors in diagnostic coding, missing surgical procedure information, or logical conflicts frequently lead to biased grouping results, directly affecting the reasonable disbursement of medical insurance funds. Simultaneously, existing systems lack efficient correlation and standardized cleaning mechanisms for multi-source heterogeneous data, making it difficult to address issues such as inconsistent data formats and low correlation of key fields. Furthermore, conventional analysis schemes often employ linear logic analysis, failing to deeply capture the non-linear data characteristics under complex clinical pathways and lacking dynamic optimization capabilities for severe complications and comorbidities. This results in a significant gap between predicted results and actual settlement needs, making it difficult to meet the requirements of precise medical insurance management.

[0005] Therefore, a DRG analysis scheme based on medical data is desired. Summary of the Invention

[0006] The purpose of this invention is to provide a DRG analysis system based on medical data, which can effectively solve the problems mentioned in the background art.

[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0008] A DRG analysis system based on medical data includes the following specific steps: Step 1: Extracting raw medical data and performing format normalization: Exporting settlement-level group data from the Disease Diagnosis Related Groups (DRG) payment management platform through an automated data interface; simultaneously extracting medical record homepage data from the medical institution information system according to the standards of the medical quality management and control information platform; performing character cleaning and structured format unification on the extracted medical record number field and discharge time field to eliminate expression differences between heterogeneous data sources; Step 2: Constructing a relational data model: Loading the preprocessed group data and the medical record homepage data into a structured query language database; using the medical record number and discharge time as common mapping primary keys; performing multi-table association using a right outer join logic to generate a merged data table with complete diagnosis and treatment information and settlement information; Step 3: Processing the weight matrix and performing cost standard mapping: Performing a structured transformation on the annual DRG weight table; establishing the mapping function between DRG codes and medical insurance payment standards; and retrieving the corresponding cost base in the weight matrix by logically combining the insurance type and DRG codes in the merged data table; Step 4: ... Predicting Surplus Risk and Labeling Basic Disease Groups: Based on the obtained cost base and the actual medical expenses incurred by medical institutions, a difference calculation is performed to generate a quantitative result of surplus prediction. The basic disease groups are then automatically labeled according to preset logical rules to identify abnormal settlement cases that deviate from the normal payment range. Step 5: Generating a Complication Mapping Table and Performing Deep Data Cleaning: The quantitative result of surplus prediction is merged with the officially released list of serious complications or comorbidities and the list of complications or comorbidities to construct a medical insurance data analysis model. By converting the data from a multidimensional matrix to a one-dimensional linear format, null values, duplicate records, and diagnostic combinations that do not conform to medical logic are successively removed, retaining a high-purity valid diagnostic dataset. Step 6: Implementing Machine Learning Intelligent Classification: A clustering algorithm is used to perform multi-stage feature clustering and classification processing on the valid diagnostic dataset, constructing a disease diagnosis-related group classification index system with adaptive adjustment capabilities to improve the system's sensitivity to complex clinical scenarios.

[0009] Preferably, in step 1, the process of extracting raw medical data includes establishing an extraction, transformation and loading mechanism for heterogeneous data sources, and realizing high-speed collection of medical data from medical institutions at different administrative levels through a preset data interface protocol, wherein the data interface protocol is configured to support streaming data transmission and batch timed task scheduling.

[0010] Preferably, the format normalization process includes uniformly converting the discharge time field into a preset time series format, performing noise reduction processing on non-character interference items in the medical record number, and uniformly converting all medical diagnosis codes into a predetermined version of the International Classification of Diseases standard.

[0011] Preferably, in step 2, when loading data into the structured query language database, the system automatically detects data integrity constraints. Abnormal rows with missing medical record numbers or incomplete key treatment records are intercepted and stored in an error log database to ensure data quality for subsequent analysis. Preferably, the right outer join logic uses the standard medical record homepage data from the medical quality management and control information platform as the baseline table. It matches the data with disease diagnosis-related groupings to ensure all in-hospital treatment records are included in the analysis. For record rows that do not generate grouping results on the settlement platform, the system automatically fills in null placeholders for subsequent data missingity audits.

[0012] Preferably, in step 3, the structured conversion of the annual disease diagnosis-related group weight table involves parsing the original file in portable document or spreadsheet format into a database-readable relational table structure and establishing a weight history database based on version numbers to support settlement traceability across different time spans. Preferably, the mapping function is configured to perform weighted calculations based on different medical insurance type parameters, where the insurance type parameters include basic medical insurance for urban employees and basic medical insurance for urban and rural residents. The system automatically matches the corresponding payment ratio factor by identifying the insurance type identification code in the merged data table.

[0013] Preferably, in step 4, the calculation formula for the surplus prediction quantification result is defined as the payment standard minus the actual total cost, wherein the payment standard is determined by multiplying the disease diagnosis-related group weight by the rate point value. Preferably, the logical rules for labeling the basic disease group include searching a preset basic disease dictionary. When the patient's primary diagnosis code or primary surgical procedure code matches an entry in the dictionary, the system writes a first preset value into the corresponding marker column of the merged data table; otherwise, it writes a second preset value, thereby achieving automated screening of the basic disease group.

[0014] Preferably, in step 5, the process of generating a list of serious complications or comorbidities includes periodically synchronizing the latest diagnostic catalog from the official server of the medical insurance management department and converting it into a logical judgment matrix within the system. Preferably, the one-dimensional linear format conversion process involves reconstructing the subordinate diagnostic information originally stored in multiple diagnostic columns into multiple rows of records, each row containing only a specific number of primary diagnostic codes and a specific number of subordinate diagnostic codes. This flattened data structure reduces the complexity of the algorithm processing.

[0015] Preferably, the null value removal operation in the deep data cleaning process involves performing non-null checks on core fields such as the main diagnostic code, medical payment method, and length of hospital stay in the merged data table. Once a null value is detected, the record is deleted or marked as abnormal. Preferably, the exclusion of diagnostic combinations that do not conform to medical logic refers to identifying and removing records with logical conflicts such as gender conflicting with diagnosis or age conflicting with diagnosis by establishing a medical logic rule base.

[0016] Preferably, in step 6, the clustering algorithm employs an improved mean clustering logic, using the patient's age, gender, length of hospital stay, surgical grade, and number of comorbidities as input feature vectors. Through iterative computation, cases with similar medical resource consumption characteristics are grouped into the same cluster center. Preferably, the multi-stage feature clustering includes a first stage of coarse-grained domain partitioning and a second stage of fine-grained disease subdivision. Through a progressively converging classification strategy, it ensures that the final generated disease diagnosis-related grouping classification index system has statistically significant inter-group differences.

[0017] Preferably, the adaptive adjustment capability refers to the system dynamically adjusting the parameter settings of the cluster center by monitoring the newly generated medical data stream in real time, in order to cope with the resource consumption distribution shift caused by the application of new diagnostic and treatment technologies or adjustments to medical insurance policies. Preferably, the system also includes a monitoring module, configured to perform full lifecycle status monitoring on all the above data processing steps. Once the time consumption of any step exceeds a preset time threshold, the system automatically triggers a performance warning.

[0018] Preferably, the system also includes an export module, which can output the final generated diagnostic analysis results as a structured medical insurance analysis report. The report content covers the enrollment rate, surplus rate, cost deviation, and coding compliance score of each disease diagnosis-related group. Preferably, the system runs on a high-performance server cluster and uses a distributed storage architecture to perform parallel processing on medical records reaching a preset scale. The execution time of a single full data cleaning and group prediction task is within a preset allowable time.

[0019] Preferably, all data transmission paths within the system are encrypted to ensure that patient privacy information is strictly protected during extraction, association, cleaning, and classification, and that all operations are logged with complete audit trail logs. Preferably, the machine learning intelligent classification also incorporates an ensemble learning mechanism, combining multiple weak classifiers to improve the accuracy of grouping low-frequency rare disease groups, keeping the identification error rate within a preset range.

[0020] Preferably, the system also includes a feedback loop that pushes the marked abnormal settlement cases to the manual review terminal, collects the reviewers' correction opinions and feeds them into the machine learning model as training samples to achieve continuous evolution of the system.

[0021] Compared with the prior art, the beneficial effects achieved by the present invention are:

[0022] 1. Significantly improves the accuracy of grouping and data consistency.

[0023] This invention completely resolves the encoding ambiguities and format conflicts between heterogeneous medical data sources by constructing a rigorous data extraction and format normalization process. Deep association achieved through right outer join logic ensures complete alignment between medical record data and settlement data, reducing grouping bias caused by missing information or incorrect association at the source. This system significantly improves the accuracy of grouping in complex diagnostic and treatment scenarios, greatly reducing the workload of manual review.

[0024] 2. Enhance the efficiency of medical insurance settlement and the ability to prevent and control risks.

[0025] The system's built-in surplus prediction and basic disease group annotation functions enable real-time quantitative analysis of medical expense payment standards. By automatically identifying abnormal settlement cases that deviate from expectations, this invention can provide medical institutions with accurate financial early warnings, avoiding unreasonable expenditures. Simultaneously, the efficient identification mechanism for serious complications or comorbidities ensures that hospitals receive reasonable compensation when treating critically ill patients, safeguarding the legitimate economic rights of medical institutions.

[0026] 3. Deeply optimize the quality of medical data

[0027] Through one-dimensional transformation and multi-dimensional deep cleaning rules, this system can effectively identify and correct logical conflicts and coding irregularities in the medical record filling process. In particular, the automated verification of medical logic between primary and secondary diagnoses assists medical institutions in completing self-correction before data reporting, thereby improving the overall compliance of reported data and providing a solid data foundation for subsequent medical insurance governance and performance evaluation.

[0028] 4. Excellent intelligent adaptability and scalability

[0029] The introduction of machine learning clustering algorithms allows this system to move beyond rigid static rules and adaptively adjust its classification system based on the evolution of actual medical practices. This dynamic optimization capability enables it to quickly adapt to the personalized needs of institutions in different regions and with varying levels of medical expertise, giving it significant value for cross-regional deployment. Furthermore, the distributed architecture design ensures that the system maintains extremely high processing efficiency when dealing with massive amounts of medical data, meeting the stringent requirements of modern digital healthcare management for real-time performance and scalability. Attached Figure Description

[0030] Figure 1 This is a schematic diagram of the overall technical solution of the present invention;

[0031] Figure 2 This is a schematic diagram illustrating the core principle of machine learning intelligent classification in this invention.

[0032] Figure 3 This is a schematic diagram illustrating the logical flow of medical raw data extraction, normalization, and correlation model construction in this invention;

[0033] Figure 4 This is a schematic diagram of the multi-level interaction relationship and data flow process of the medical insurance data analysis model based on surplus risk prediction and complication mapping in this invention. Detailed Implementation

[0034] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0035] Example 1

[0036] In the Diagnosis Related Groups (DRG) analysis system based on medical data, the core operating logic of the system lies in achieving accurate modeling of medical insurance payments through structured governance of high-dimensional data and an intelligent classification engine. Specifically, the method and system provided in this embodiment of the invention involve the following detailed technical aspects.

[0037] For step 1, raw medical data is extracted and format normalization is performed. During this process, the system establishes an Extract, Transform, and Load (ETL) mechanism for heterogeneous data sources. The automated data interface achieves high-speed acquisition of medical data from medical institutions at different administrative levels through a preset interface protocol. The data interface protocol is configured to support streaming data transmission and batch scheduled tasks. In actual execution, the streaming data transmission mode is used to acquire settlement-level grouping data from the DRG payment management platform in real time, ensuring data timeliness with a delay of less than a second; the batch scheduled tasks are used to extract large amounts of medical record homepage data from the medical institution's internal Medical Record Information System (HIS) or Medical Quality Management and Control Information Platform (HQMS).

[0038] Specifically, during the extraction process, the system first establishes an encrypted channel with the DRG payment management platform through a secure authentication mechanism, exporting settlement-level data containing fields such as settlement list number, grouping result, actual payment amount, and settlement time. Simultaneously, based on the HQMS standard, it retrieves the original medical record front page from the hospital's local database, covering dimensions such as patient basic information, diagnostic information (primary diagnosis and other diagnoses), surgical procedure information, total inpatient costs and their details, admission route, and discharge destination. For the extracted medical record number and discharge time fields, the system performs refined character cleaning.

[0039] The format normalization process includes converting the discharge time field into a preset time series format, such as the "YYYY-MM-DD HH:MM:SS" standard. For the medical record number field, the system automatically identifies and removes spaces, tabs, special non-character interference items, and leading zeros, thereby eliminating matching obstacles caused by different input habits. Furthermore, the normalization process also involves the standardized conversion of medical diagnostic codes, that is, using a built-in mapping dictionary to uniformly convert diagnostic codes from all sources into a predetermined version of the International Classification of Diseases (ICD-10 medical insurance version or a specific version of ICD-11), ensuring that subsequent analysis is conducted within a unified semantic space.

[0040] For step 2, a related data model is constructed. The preprocessed grouped data and medical record cover page data are loaded into a Structured Query Language (SQL) database. During the loading process, the system automatically triggers an integrity constraint detection algorithm. The integrity constraint detection includes non-null checks on key fields and data type validity verification. For abnormal rows with missing medical record numbers, incomplete core treatment records (such as primary diagnosis codes), or discharge dates later than admission dates, the system executes interception logic, storing such abnormal data in an error log database in real time and generating an abnormal data audit report.

[0041] During the generation of the merged data table, the system uses medical record number and discharge time as common mapping primary keys. A right outer join is used to perform the multi-table association. Specifically, the HQMS standard medical record cover sheet data is used as the base table (right table), and matched with DRG grouping data. The core engineering consideration of this join method is to ensure that all inpatient treatment records (i.e., medical record cover sheet records) are included in the analysis, even if some records have not yet generated final grouping results on the DRG payment platform. For record rows that do not match grouping results, the system automatically fills in preset null placeholders (such as "NULL" or "PENDING") to facilitate subsequent data missingity audits. The merged data table after association achieves deep integration of clinical treatment dimensions and medical insurance settlement dimensions.

[0042] For step 3, the weight matrix is processed and cost standard mapping is performed. The system performs a structured transformation on the annual disease diagnosis-related group weight table. This structured transformation involves parsing the static weight table, originally existing in portable document format (PDF) or spreadsheet format (Excel), into a database-readable dynamic relational table structure. The system establishes a version number-based weight history database, supporting high-speed retrieval of weights for different settlement years and regions through an indexing mechanism.

[0043] Based on this, a mapping function is established between Diagnosis Related Groups (DRG) codes and medical insurance payment standards. This mapping function is configured to perform weighted calculations based on different medical insurance type parameters. Specifically, the system automatically matches the corresponding payment ratio factor and pooled fund sharing ratio by identifying the insurance type identification codes (such as urban employee basic medical insurance, urban and rural resident basic medical insurance, and cross-regional medical treatment settlement codes) in the merged data table. The mapping logic retrieves the baseline weight for that group by searching the corresponding DRG code in the weight matrix.

[0044] For step 4, the system predicts the surplus risk and labels the underlying disease groups. The system performs a difference calculation based on the obtained cost base and the actual medical expenses incurred by the medical institutions. The difference calculation defines the quantitative result of the surplus prediction using the following formula:

[0045]

[0046] in, This indicates the quantitative result of the surplus forecast. This represents the weight value of the corresponding disease diagnosis-related group. This represents the preset premium rate points for the unified planning area (i.e., the standard amount corresponding to each weighted point), and FF represents the adjustment factor set for the type of insurance or hospital level. This indicates the total actual expenses incurred by the patient as recorded on the first page of the medical record.

[0047] The system generates a surplus prediction quantification result table based on the calculated surplus value. A positive surplus value indicates that the case has a surplus under the current DRG standard; a negative surplus value is marked as a loss warning. Simultaneously, the system automatically labels the basic disease groups according to preset logical rules. These logical rules include searching a preset basic disease dictionary; when a patient's primary diagnosis code or primary surgical procedure code matches a dictionary entry, the system labels a specific column in the merged data table (e.g., "..."). The first preset value (e.g., Arabic numeral 1) is written in the label; if no match is found, the second preset value (e.g., Arabic numeral 0) is written. This labeling mechanism enables automated screening of basic and complex disease groups, making it easier for medical institutions to identify abnormal settlement cases that deviate from the normal payment range, especially for risk screening of low-standard cases or high-coded cases.

[0048] For step 5, a complication mapping table is generated and deep data cleaning is performed. The system logically merges the surplus prediction quantification results with the officially released lists of severe complications or comorbidities (MCC) and complications or comorbidities (CC). The process of building the medical insurance data analysis model involves periodically synchronizing the latest diagnostic catalog and mapping rules from the official server of the medical insurance management department and transforming them into a logical judgment matrix within the system.

[0049] During the data cleaning phase, the system performs a one-dimensional linear format transformation. This transformation involves reconstructing the flat data, originally stored in multiple subordinate diagnostic columns, into multi-row records with parent-child relationship characteristics. Each row contains only a unique settlement serial number, a primary diagnostic code, and a specific subordinate diagnostic code. This structured reconstruction reduces the computational complexity of subsequent association rule mining algorithms. Deep data cleaning also includes removing null values from fields. The system performs strict non-null checks on core fields in the merged data table (such as primary diagnostic codes, medical payment methods, and length of hospital stay). If a core field is detected as missing, the record is removed or marked as requiring correction.

[0050] Furthermore, the system establishes a medical logic rule base to exclude diagnostic combinations that do not conform to medical logic. This medical logic rule base includes checks for conflicts between gender and diagnosis (e.g., males receiving obstetric-related diagnoses), age and diagnosis (e.g., newborns receiving diagnoses of chronic diseases), and site of illness. Through this deep, automated verification, the system can effectively retain a high-purity, valid diagnostic dataset for subsequent analysis and modeling.

[0051] For step 6, machine learning intelligent classification is implemented. The system uses a clustering algorithm to perform multi-stage feature clustering and classification processing on the effective diagnostic dataset. The clustering algorithm uses improved mean clustering logic, taking the patient's age, gender, length of hospital stay, surgical grade, number of comorbidities, and resource consumption level as input feature vectors.

[0052] In practice, the clustering process is divided into two stages. The first stage performs coarse-grained domain segmentation, initially classifying cases according to anatomical systems (such as the respiratory and circulatory systems). The second stage performs fine-grained disease sub-segmentation, identifying potential subgroups within the same anatomical system based on the similarity of resource consumption characteristics using clustering algorithms. The distance calculation in the clustering process uses the following Euclidean distance formula:

[0053]

[0054] in, and These represent feature vectors from two different cases. The feature dimension is defined as follows. Through iterative calculations, the system groups cases with similar medical resource consumption characteristics into the same cluster center, and constructs a DRG classification index system with adaptive adjustment capabilities accordingly.

[0055] The adaptive adjustment capability refers to the system dynamically adjusting the parameter settings of the cluster centers by monitoring newly generated medical data streams in real time. When the distribution of resource consumption shifts due to the widespread application of new diagnostic and treatment technologies (such as minimally invasive surgery replacing traditional open surgery) or significant adjustments to medical insurance policies, the system can automatically sense changes in data distribution characteristics and recalculate the classification boundaries, thereby improving the system's sensitivity to complex clinical scenarios.

[0056] To ensure efficient system operation, a series of supporting modules are integrated. The monitoring module is configured to perform full lifecycle status monitoring for all data processing steps (from data extraction to intelligent classification). This module records the time consumption, data throughput, and memory usage of each step using data tracking technology. Once the time consumption of any step exceeds a preset time threshold (e.g., the data extraction stage lasts for more than 120 seconds), the system automatically triggers a performance alert and pushes alarm information to the management terminal via an asynchronous communication protocol.

[0057] The export module can output the final diagnostic analysis results as a structured medical insurance analysis report. The report content covers the enrollment rate, surplus rate, cost deviation, and coding compliance score for each disease diagnosis-related group. The report supports export in multiple formats (such as CSV, XML, or structured JSON) to facilitate data integration with other hospital management systems (such as the HRP system).

[0058] At the architectural level, the system runs on a high-performance server cluster and employs a distributed storage architecture (such as HDFS or a distributed database cluster). For medical records reaching a preset scale (e.g., tens of millions of records), the system executes parallel processing logic, breaking down the full data cleaning and grouping prediction tasks into multiple sub-tasks for parallel execution, ensuring that the execution time of a single full task is within a preset allowable time. All data transmission paths within the system are processed using advanced encryption standards (AES-256) to ensure that patient privacy information is rigorously protected during extraction, association, cleaning, and classification. All CRUD operations on the data retain complete audit trail logs, ensuring data traceability and immutability.

[0059] The machine learning intelligent classification process further incorporates an ensemble learning mechanism. By combining multiple weak classifiers (such as a combination of random forest and gradient boosting tree), the accuracy of grouping low-frequency rare diseases is improved. The ensemble learning module determines the final inclusion suggestions through a weighted voting mechanism, strictly controlling its identification error rate to within a preset range of 0.05%.

[0060] Furthermore, the system establishes a closed-loop feedback system. Anomaly settlement cases are automatically pushed to the human review terminal. Reviewers provide corrections or suggestions based on the reasons for the anomalies (such as coding errors, omissions in expense recording, or inappropriate rules). The system collects these corrective opinions from reviewers and feeds them back into the machine learning model as high-quality incremental training samples. Through this reinforcement learning-style feedback mechanism, the system can achieve continuous algorithm evolution and improved classification accuracy.

[0061] In a specific operational scenario, when the system accesses quarterly data from a large medical institution, it first extracts 50,000 settlement data entries from the institution's settlement database and corresponding 60,000 original medical records from its medical record database via the automated interface in step 1. In step 2, through an SQL right outer join, the system discovers that 10,000 medical record homepage records could not find corresponding grouping results on the settlement platform, triggering a data missingity audit and identifying that these records mostly belong to patients hospitalized across fiscal years. In steps 3 and 4, the system calculates the overall profit and loss expectation for the quarter using a mapping function and accurately identifies 300 basic disease group cases under loss warning status. In step 5, through one-dimensional cleaning, the system finds that 50 of these loss cases are due to the omission of serious complication codes in the MCC list. Finally, in step 6, a machine learning model, through cluster analysis of the institution's data, suggests that management add specific subgroup classifications for a certain new technology to more accurately reflect resource consumption levels.

[0062] Example 2

[0063] Based on Example 1, this embodiment further details the in-depth processing and correlation mechanism for unstructured medical data (such as chief complaints and present medical history in electronic medical records) to enhance the dimensions of DRG analysis.

[0064] In this embodiment, a Natural Language Processing (NLP) module is added to the extraction process in step 1. The NLP module is configured to perform semantic recognition on electronic medical record text other than the front page of the medical record. Specifically, the system uses a pre-trained medical pre-trained language model (such as Med-BERT) to extract entities from the doctor's descriptive text. The extracted medical entities include symptoms, signs, non-surgical procedures, and examination and test results. These extracted entity data, after format normalization, are integrated as supplementary dimensions into the association data model in step 2.

[0065] When constructing the associated data model, the system not only uses the medical record number and discharge time as common mapping primary keys, but also introduces a diagnostic logic verification primary key. This diagnostic logic verification primary key is used to compare whether the coded diagnosis on the medical record's cover page is consistent with the descriptive diagnosis in the electronic medical record. By performing cross-table logical verification in the SQL database, the system can identify records where the diagnostic code and the description of the condition are significantly inconsistent. For example, when the electronic medical record frequently contains descriptions of symptoms related to "acute myocardial infarction," but the medical record's cover page only codes it as "stable angina," the system automatically marks it with a "code underestimates risk" label in the merged data table.

[0066] For the deep data cleaning in step 5, this embodiment introduces knowledge graph-based logical constraints. The medical logic rule base is constructed as a multi-relational knowledge graph, containing strong association rules between diseases and drugs, diseases and surgeries, and diseases and examinations. During the cleaning process, the system traverses the knowledge graph and performs a rationality score on each diagnostic combination. If the association strength between the primary diagnosis and its corresponding primary surgical procedure in a record is lower than a preset threshold in the knowledge graph, the record is marked as a "logical suspect" and proceeds to the intelligent classification stage in step 6 for special processing.

[0067] In step 6, the input feature vector of the clustering algorithm is further expanded. In addition to the features described in Example 1, a disease severity score extracted based on NLP and a diagnosis-treatment consistency score based on a knowledge graph are also added. The system uses a multi-stream convolutional neural network (Multi-stream CNN) or a graph convolutional network (GCN) to fuse these heterogeneous features. The multi-stream network processes the structured data stream and the text feature stream respectively, and finally performs feature fusion in a fully connected layer to output a more accurate DRG classification scheme.

[0068] Furthermore, the feedback loop in this embodiment incorporates an automatic suggestion generation function. When the system identifies an abnormal settlement case, it not only pushes the information to the reviewers but also automatically generates a correction suggestion report using a generative model (such as an LLM in the healthcare vertical). The report details the recommended addition of MCC / CC codes and their supporting evidence in the electronic medical record. This automated suggestion generation mechanism significantly improves the efficiency of coders and shortens the data governance cycle.

[0069] In terms of distributed architecture, this embodiment adopts a containerized deployment solution (such as a Kubernetes cluster). Each data processing step is encapsulated as an independent microservice component. Highly reliable data transmission between microservices is achieved through message queues (such as Kafka). This architectural design gives the system strong horizontal scalability, enabling it to cope with sudden surges in traffic generated by medical institutions during peak application periods by dynamically adding computing nodes.

[0070] This embodiment also particularly strengthens the security audit mechanism. The system introduces a blockchain-based evidence storage module, which writes the hash value of each batch of processed medical insurance data analysis tables into a private or consortium blockchain in real time. This ensures the authority of the analysis results, prevents any unauthorized tampering with the generated analysis reports, and meets the stringent requirements of medical insurance regulatory authorities for data authenticity.

[0071] Example 3

[0072] This embodiment focuses on describing the application of the present invention in a cross-regional medical consortium (medical alliance) environment, and how to improve the versatility of the DRG analysis model through multi-center collaborative learning.

[0073] In cross-regional application scenarios, the automated data interface in step 1 is configured with a multi-tenant architecture. Medical institutions at different levels (such as primary healthcare centers, secondary hospitals, and tertiary hospitals) upload data to the central node through their respective independent access nodes. The system pre-sets multiple normalization templates to address the data heterogeneity across different institutions. During normalization, the system automatically identifies the institution identifier of the access node and calls the corresponding transformation rule set, thereby achieving efficient normalization of heterogeneous data within the region.

[0074] In step 2, the associated data model supports a federated query mode. For some raw data involving high privacy or that cannot be directly aggregated due to policy restrictions, the system distributes query commands to local nodes in each institution to perform local associations, ensuring that the raw data does not leave the hospital. The intermediate statistics generated by the local associations (excluding patient identity information) are then aggregated to the central server to execute a global right outer join logic. This federated data association mode solves the data silo problem and meets data privacy compliance requirements.

[0075] For step 3, the system establishes a regional weight standard library. Due to differences in medical insurance policies and premium rate values across different regions, the mapping function is configured to support geofencing logic. The system automatically switches the corresponding weight table and premium rate calculation parameters based on the administrative division code of the medical institution. For cases involving cross-regional medical treatment, the system automatically triggers a multi-regional strategy matching engine to perform composite mapping calculations based on the settlement agreements of the insured's place of residence and the place of medical treatment.

[0076] In step 6, the machine learning intelligent classification, this embodiment introduces a federated learning framework. Each medical institution uses local data to train a local clustering model and classifier. Through a secure aggregation protocol, the central server aggregates the gradient information or cluster center parameters of each node and updates the global classification model. This collaborative learning model allows the system to incorporate typical case characteristics from different medical institutions, solving the problem of low model recognition rates in individual hospitals due to insufficient rare disease sample sizes. The global DRG classification index system optimized by federated learning has better statistical representativeness within the region.

[0077] At the application output layer, this embodiment adds a regional medical resource scheduling suggestion module. By analyzing the DRG enrollment status, surplus distribution, and medical resource consumption intensity across the entire region, the system can provide resource allocation decision support for health administrative departments. For example, when the system detects that the actual cost of a certain type of DRG group in a certain region is generally much higher than the payment standard and the surplus rate is negative, the system automatically issues a warning of abnormal treatment costs for that group and analyzes whether it is caused by the overuse of specific high-value consumables.

[0078] The monitoring module in this embodiment also integrates network topology monitoring functionality. On cross-regional transmission links, the system monitors bandwidth utilization and link stability in real time. For remote institutions with poor network environments, the system automatically switches to breakpoint resume mode and data compression mode (such as using the efficient Protobuf serialization format) to ensure data transmission reliability. Simultaneously, the feedback loop is extended to a cross-institutional online expert collaborative review platform. For complex and difficult-to-diagnose disease group inclusion disputes, multi-center online expert consultation reviews can be initiated, and the review results are ultimately compiled into a regional standard coded knowledge base.

[0079] In performing deep data cleaning, this embodiment also introduces anomaly detection algorithms (such as isolated forests or autoencoder structures). The system not only removes logically conflicting records, but also uses unsupervised learning to identify "outliers" that, while logically consistent, deviate significantly from the population in statistical distribution. These outliers often indicate undiscovered new medical practices or potential medical insurance violations. The system separately marks them to guide managers in conducting targeted in-depth audits.

[0080] The embodiments of this invention integrate multiple technical means to achieve fully automated processing from underlying raw medical data to top-level decision support reports. Its core lies not only in the faithful execution of existing DRG grouping logic, but also in constructing a high-precision digital foundation for medical insurance governance with self-evolving capabilities through machine learning, big data correlation, and deep cleaning technologies. This foundation can adapt to constantly evolving medical technologies and medical insurance policies, providing medical institutions with stable, transparent, and scientific means of financial forecasting and quality monitoring.

[0081] Those skilled in the art should understand that the embodiments described above are merely typical implementations of the present invention. In actual engineering deployments, the parameter configurations, algorithm selections, and storage architectures of each module can be flexibly adjusted according to different server environments, data scales, and business complexity. For example, in ultra-large-scale data scenarios, the SQL database can be replaced with a distributed columnar storage system to improve query performance; in scenarios with extremely high security requirements, fully homomorphic encryption technology can be introduced to perform data processing. These subtle evolutions or equivalent substitutions based on the technical concept of the present invention should all be covered within the protection scope of the present invention.

[0082] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A DRG analysis system based on medical data, characterized in that, Includes the following steps: Step 1: Establish an extraction, transformation and loading mechanism for heterogeneous data sources. Export settlement-level group data from the disease diagnosis-related group payment management platform through an automated data interface. At the same time, extract medical record homepage data from the medical institution information system according to the medical quality management and control information platform standard. Perform character cleaning and structured format unification on the extracted medical record number field and discharge time field to eliminate expression differences between heterogeneous data sources. Step 2: Load the preprocessed grouped data and the medical record homepage data into the structured query language database, use the medical record number and discharge time as the common mapping primary key, and perform multi-table association using the right outer join logic to generate a merged data table with complete diagnosis and treatment information and settlement information; Step 3: Perform a structured transformation on the annual disease diagnosis-related group weight table, establish the mapping function between disease diagnosis-related group codes and medical insurance payment standards, and retrieve the corresponding cost base in the weight matrix by logically combining the insurance types and disease diagnosis-related group codes in the merged data table. Step 4: Perform a difference calculation based on the obtained cost base and the actual medical expenses incurred by the medical institution to generate a surplus prediction quantification result, and automatically label the basic disease groups according to the preset logical rules to identify abnormal settlement cases that deviate from the normal payment range. Step 5: Combine the predicted quantitative results of the surplus with the list of serious complications or comorbidities and the list of complications or comorbidities to construct a medical insurance data analysis model. By converting the data from a multidimensional matrix to a one-dimensional linear format, the null values of the fields, duplicate records, and diagnostic combinations that do not conform to medical logic are removed in turn to retain a high-purity effective diagnostic dataset. Step 6: Perform multi-stage feature clustering and classification processing on the effective diagnostic dataset using a clustering algorithm to construct a disease diagnosis-related grouping classification index system with adaptive adjustment capabilities.

2. The DRG analysis system based on medical data according to claim 1, characterized in that, In step 1, the specific process of establishing an extraction, transformation, and loading mechanism for heterogeneous data sources is as follows: High-speed collection of medical data from medical institutions at different administrative levels is achieved through a preset data interface protocol; the data interface protocol supports streaming data transmission and batch scheduled tasks; Specifically, the settlement-level grouping data of the disease diagnosis-related group (DRG) payment management platform is obtained using the streaming data transmission mode to ensure the real-time nature of the settlement list number, grouping results, actual payment amount, and settlement time fields; a large amount of medical record homepage data is extracted from the medical record information system or medical quality management and control information platform within the medical institution using the batch scheduled task mode. The medical record homepage data includes basic patient information, primary diagnosis code, subordinate diagnosis code, surgical operation information, total inpatient costs and their details, admission route, and discharge destination; character cleaning is performed on the extracted medical record number field, including automatically identifying and removing spaces, tabs, special non-character interference items, and leading zeros within the field to eliminate matching obstacles caused by different input habits.

3. The DRG analysis system based on medical data according to claim 1, characterized in that, In step 1, the specific process of unifying the structured format is as follows: the discharge time field is uniformly converted into a preset time series format; the built-in mapping dictionary is used to uniformly convert all source diagnostic codes into a predetermined version of the International Classification of Diseases standard, so that all medical diagnostic data are processed in a unified semantic space; in step 2, when the data is loaded into the structured query language database, the integrity constraint detection algorithm is triggered; the integrity constraint detection algorithm performs non-empty checks and data type validity checks on key fields; when abnormal row records with missing medical record numbers, incomplete main diagnostic codes, or discharge dates later than admission dates are identified, the interception logic is executed and the abnormal row records are stored in the error log database in real time, generating a corresponding abnormal data audit report to ensure that the data entering the subsequent analysis process meets the preset quality requirements.

4. The DRG analysis system based on medical data according to claim 1, characterized in that, In step 2, the specific process of performing multi-table association using the right outer join logic is as follows: using the medical record homepage data of the medical quality management and control information platform standard as the benchmark of the right table, and matching it with the disease diagnosis-related group data, all in-hospital treatment records are included in the analysis scope. For record rows that do not generate grouping results on the settlement platform, preset null placeholders are automatically filled, and a data missingness audit process is triggered to identify the record status of inpatients across years; in step 3, the specific process of the structured transformation is as follows: the original weight file in portable document format or spreadsheet format is parsed into a database-readable relational table structure using a document parsing engine, and a weight history database based on version number is established; the indexing mechanism supports data retrieval for different settlement years and different regional version weights.

5. The DRG analysis system based on medical data according to claim 1, characterized in that, In step 3, the mapping function is configured to perform weighted calculations based on different medical insurance type parameters; the mapping function automatically matches the corresponding payment ratio factor and pooled fund sharing ratio by identifying the insurance type identification code in the merged data table; wherein, the insurance type identification code covers urban employee basic medical insurance, urban and rural resident basic medical insurance, and cross-regional medical treatment settlement code; in step 4, the difference calculation logic for generating the surplus prediction quantification result is as follows: multiply the weight value of the corresponding disease diagnosis-related group by the preset rate point value of the pooled area, and then multiply by the adjustment factor set for the insurance type or hospital level to obtain the payment standard; subtract the total actual cost incurred by the patient recorded on the first page of the medical record from the payment standard to obtain the surplus prediction quantification result; if the surplus prediction quantification result is positive, it is determined that the case has a surplus space; if the surplus prediction quantification result is negative, a loss warning label is marked in the merged data table.

6. The DRG analysis system based on medical data according to claim 1, characterized in that, In step 4, the logical rules for automatically labeling the basic disease group are as follows: a preset basic disease dictionary is retrieved; when the patient's primary diagnosis code or primary surgical operation code matches a dictionary entry, a first preset value is written into a specific marker column of the merged data table; if no dictionary entry is matched, a second preset value is written into the marker column, thus achieving automated screening of the basic and complex disease groups. In step 5, the process of constructing the medical insurance data analysis model includes periodically synchronizing the latest diagnostic catalog and mapping rules from the official server of the medical insurance management department and converting them into a logical judgment matrix within the system; through logical merging operations, the surplus prediction quantification results are associated with the list of serious complications or comorbidities and the list of complications or comorbidities, establishing the mapping relationship between clinical diagnosis and medical insurance payment compensation.

7. The DRG analysis system based on medical data according to claim 1, characterized in that, In step 5, the specific process of converting the data from a multidimensional matrix to a one-dimensional linear format is as follows: the flat data originally stored in multiple subordinate diagnosis columns is reconstructed into multiple rows of records with parent-child relationship characteristics, ensuring that each row of records contains only a unique settlement serial number, a primary diagnosis code, and a specific subordinate diagnosis code, reducing the computational complexity of subsequent association rule mining; the null value removal operation in the deep data cleaning specifically involves: performing non-null checks on the core fields of primary diagnosis code, medical payment method, and length of hospital stay; once a core field is detected to be missing, the record is removed or marked as pending correction; the specific process of removing diagnosis combinations that do not conform to medical logic is as follows: a medical logic rule base is established, which includes rules for verifying conflicts between gender and diagnosis, age and diagnosis, and anatomical location; by traversing the medical logic rule base, a rationality score is performed on each diagnosis combination, and conflicting records with scores below a preset threshold are removed.

8. A DRG analysis system based on medical data according to claim 1, characterized in that, In step 6, the clustering algorithm employs an improved mean clustering logic, using the patient's age, gender, length of hospital stay, surgical grade, number of comorbidities, and resource consumption level as input feature vectors. The multi-stage feature clustering process specifically includes: a first stage performing coarse-grained domain partitioning, initially classifying cases according to anatomical systems; a second stage performing fine-grained disease subdivision, identifying potential subdivision groups within the same anatomical system based on the similarity of resource consumption characteristics using the clustering algorithm; during the iterative clustering operation, the distance between feature vectors of different cases is calculated using the Euclidean distance formula, grouping cases with similar medical resource consumption characteristics into the same cluster center, and constructing a disease diagnosis-related grouping classification index system accordingly; the adaptive adjustment capability refers to dynamically adjusting the parameter settings of the cluster centers by real-time monitoring of newly generated medical data streams, automatically recalculating the classification boundary when resource consumption distribution characteristics shift.

9. A DRG analysis system based on medical data according to claim 1, characterized in that, Including the engineering implementation steps: The monitoring module performs full lifecycle status monitoring of the entire data processing process. By using data tracking technology, it records the time consumption, data throughput, and memory usage of each step. Once the time consumption of any step exceeds the preset time threshold, a performance warning is automatically triggered and an alarm message is pushed to the management terminal. The export module outputs the final diagnostic analysis results as a structured medical insurance analysis report. The content of the medical insurance analysis report covers the enrollment rate, surplus rate, cost deviation, and coding compliance score of each disease diagnosis-related group. The method runs on a high-performance server cluster and uses a distributed storage architecture to perform parallel processing on massive medical records. The full data cleaning and group prediction tasks are split into multiple sub-tasks and executed in parallel through a distributed computing framework. All data transmission paths within the system are processed using advanced encryption standards to ensure that patient privacy information is protected during extraction, association, cleaning, and classification, and that all data operations are recorded in complete audit trail logs.

10. A DRG analysis system based on medical data according to claim 1, characterized in that, In step 6, an ensemble learning mechanism is introduced to improve the accuracy of grouping low-frequency rare diseases. A weighted voting mechanism is used to combine the outputs of multiple weak classifiers to determine the final inclusion recommendation. The method also includes constructing a closed-loop feedback loop, pushing the labeled abnormal settlement cases to the manual review terminal, and collecting the reviewers' correction opinions on coding errors, omissions in fee recording, or improper rules. The correction opinions are fed back into the machine learning model as high-quality incremental training samples to achieve continuous evolution of classification accuracy and algorithm logic. In the application scenario of cross-regional medical consortia, the automated data interface in step 1 is configured as a multi-tenant architecture, automatically switching the corresponding normalization template according to the administrative division code of the medical institution. In step 2, a federated query mode is adopted. Under the premise of ensuring that the original data does not leave the hospital, query instructions are distributed to the local nodes of each institution to perform local associations, and the generated intermediate statistics are aggregated to the central server to perform global logical operations.