An old person multi-disease co-morbidity evolution path identification system based on a large model
By constructing a system for identifying the evolution path of multiple diseases in the elderly based on a large model, the problems of inconsistent formats and non-standard recording of multi-source heterogeneous medical data have been solved. This has enabled highly accurate diagnosis and treatment of multiple diseases in the elderly and intelligent auxiliary decision-making, thereby improving clinical diagnosis and treatment efficiency and public health prevention and control capabilities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- THE UNIVERSITY-TOWN HOSPITAL AFFILIATED TO CHONGQING MEDICAL UNIVERSITY
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies lack a unified data platform and dedicated models in the field of multiple diseases in the elderly, resulting in inconsistent data formats, scattered storage, and poor security. This makes it difficult to accurately identify disease evolution paths and risk factors, and fails to meet the needs of clinical diagnosis and treatment optimization.
A system for identifying the evolution path of multiple diseases in the elderly based on a large model is constructed. Through data cleaning and aggregation modules, a big data platform for multiple diseases, and an artificial intelligence diagnosis and treatment application module, the system achieves standardized processing and storage of multi-source heterogeneous medical data. Combined with AI technology, it accurately identifies the evolution path and risk factors of comorbidities and recommends intelligent diagnosis and treatment plans.
It enables efficient storage and secure sharing of data on multiple comorbidities in the elderly, accurately identifies disease evolution paths and risk factors, improves the accuracy and efficiency of clinical diagnosis and treatment, provides data and technical support for proactive prevention and control of multiple comorbidities, reduces the risk of misdiagnosis and missed diagnosis, and supports scientific intelligent auxiliary decision-making.
Smart Images

Figure CN122201759A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical data processing technology, and more specifically, to a system for identifying the evolution path of multiple comorbidities in the elderly based on a large model. Background Technology
[0002] my country has a large and rapidly growing elderly population, with long periods of living with illness, making the prevention and control of comorbidities (i.e., having two or more chronic diseases simultaneously) a serious challenge. At the end of 2018, my country's population aged 65 and above reached 166.58 million, accounting for 11.9% of the total population. 75% of the elderly suffer from at least one chronic disease, and the proportion of comorbidities among discharged patients aged 65 and above is as high as 88%. Comorbidities in the elderly are characterized by involvement of multiple systems and disciplines, large sample sizes, complex etiologies, long evolution periods, and diverse phenotypes. Studying their current status, influencing factors, and evolutionary pathways, and optimizing clinical treatment plans, are crucial for ensuring the proactive health of the elderly. There is an urgent need to leverage big data and artificial intelligence technologies to overcome the limitations of traditional treatment models.
[0003] Currently, China's medical statistical modeling technology is nearing maturity, but there is a shortage of specialized models for chronic diseases in the elderly, insufficient data resources, and problems such as inconvenient large-capacity data storage, poor security, and insufficient sharing and scalability. Although there are more in-depth model designs abroad, related research and applications are still relatively limited. Secondly, China has established several national-level disease-specific database platforms (such as the "one database, one network" for malignant tumors and the rare disease registration system), but the construction of disease-specific databases for chronic diseases in the elderly is still lacking. Internationally, although there are pan-European biobanks (BBMRI-ERIC) and Korean OMOP / OHDSI omics and clinical databases, there is also a lack of research on disease-specific databases for chronic diseases in the elderly. Furthermore, research on comorbidities in China started late and lacks large-scale epidemiological studies. Data is mostly collected in a scattered manner through follow-up, which is disorganized and difficult to integrate. Internationally, research started earlier, and although relevant nursing guidelines and treatment strategy reviews have been published, a unified platform that can gather comprehensive data on chronic diseases in the elderly has not yet been established, making it difficult to systematically study the evolution path and development trend through statistical models.
[0004] The application of existing technologies in this field still has the following problems:
[0005] At the data level: Multi-source heterogeneous medical data (from multiple hospital information systems) has inconsistent formats, non-standard records, and contains junk and redundant data. It lacks a standardized terminology system, making data cleaning and integration difficult, and its storage, sharing, and security are hard to guarantee. At the platform level: There is a lack of a dedicated big data sharing platform for comorbidities in the elderly, which cannot effectively aggregate multi-dimensional data such as clinical, imaging, and multi-omics data, making it difficult to support large-scale research and intelligent applications. At the model and application level: There is a lack of dedicated statistical models for comorbidities in the elderly. Traditional research methods are limited to simple data analysis and manual follow-up, which cannot accurately explore disease evolution paths and risk factors. Furthermore, there is a lack of intelligent diagnostic and treatment decision-making tools, making it difficult to meet the needs of clinical diagnosis and treatment optimization.
[0006] Therefore, the present invention aims to provide a system for identifying the evolution path of multiple comorbidities in the elderly based on a large model, in order to solve the above-mentioned problems. Summary of the Invention
[0007] The purpose of this invention is to provide a system for identifying the evolution path of multiple comorbidities in the elderly based on a large model. This invention constructs a high-quality database of chronic diseases in the elderly through standardized processing and distributed storage, filling the gap in domestic databases of chronic diseases in the elderly. At the same time, relying on a terminology system and AI technology, it accurately identifies the evolution path and risk factors of comorbidities, intelligently recommends treatment plans, significantly improves the accuracy and efficiency of clinical diagnosis and treatment, provides data and technical support for the proactive prevention and control of multiple comorbidities, and has broad clinical application value.
[0008] The above-mentioned technical objective of the present invention is achieved through the following technical solution: a system for identifying the evolution path of multiple diseases in the elderly based on a large model, including a data cleaning and aggregation module, a big data platform for multiple diseases in the elderly, a data model construction module, and an artificial intelligence diagnosis and treatment application module. Each module achieves data interoperability and logical linkage through standardized interfaces.
[0009] The data cleaning and aggregation module is used to connect with multiple information systems in the hospital to collect, clean, structure, and integrate multi-source heterogeneous medical data, and output standardized datasets.
[0010] The big data platform for multiple diseases among the elderly is built on the Hadoop distributed architecture and is used to provide data storage, real-time synchronization, efficient query and computing resource scheduling services, and to provide data support for model building and large model training.
[0011] The data model building module is used to establish a standardized terminology system, a clinical knowledge rule engine library, and a statistical model, providing medical logical constraints and a data foundation for large-scale model training.
[0012] The AI-powered diagnosis and treatment application module is based on a large model to identify the evolution path of multiple comorbidities in the elderly, analyze risk factors, and intelligently recommend treatment plans. It also receives clinical feedback data for system iteration and optimization.
[0013] The present invention is further configured such that: the data cleaning and aggregation module includes a data acquisition unit, a data cleaning unit, and a data integration unit;
[0014] The data acquisition unit uses OGG and CDC incremental extraction technology to connect with the hospital's HIS, EMRS, LIS, and RIS information systems to acquire basic patient information, medical records, test results, and imaging reports for elderly patients aged 65 and above.
[0015] The data cleaning unit integrates natural language processing tools and data verification algorithms to complete the removal of junk data, deduplication of redundant data, completion and marking of missing data, and structured transformation of unstructured data.
[0016] The data integration unit, based on preset data standards, uses ETL tools to classify and map structured data, match field labels, and fuse information to form a complete patient comorbidity data archive.
[0017] The present invention is further configured such that the big data platform for multiple diseases in the elderly includes a data storage unit, a data synchronization unit, a data query unit, and a resource scheduling unit;
[0018] The data storage unit adopts a hybrid storage mode of HDFS and HBase, with HDFS storing massive historical data and HBase storing frequently queried data.
[0019] The data synchronization unit achieves real-time incremental synchronization with the hospital's Oracle, SQL Server, and DB2 production databases through OGG non-invasive acquisition and CDC native functions, without affecting the operation of the production databases.
[0020] The data query unit is based on Presto to realize second-level interactive query of multiple data sources, and supports multi-dimensional retrieval by patient ID, disease type, and time range;
[0021] The resource scheduling unit dynamically allocates computing resources based on Yarn, supporting large-scale data parallel processing and large model training tasks.
[0022] The present invention is further configured such that: the data model construction module includes a terminology system construction unit, a data quality control unit, a rule engine library unit, and a statistical model unit;
[0023] The terminology system construction unit establishes a terminology model and terminology set for comorbidities in the elderly through research on domestic and international medical terminology standards, expert consultation, and literature analysis, clarifying the semantic relationships between diseases, symptoms, and test indicators;
[0024] The data quality control unit formulates standardized data acquisition procedures, standardized data processing rules, and quality evaluation indicators to construct a data quality control framework.
[0025] The rule engine library unit integrates guidelines for the diagnosis and treatment of comorbidities in the elderly, expert experience, and epidemiological research results, extracts disease association rules and risk factor rules, and encodes and stores them using a production rule representation method.
[0026] The statistical model unit employs association rule mining, logistic regression, and cluster analysis algorithms, combined with clinical data and multi-omics data, to construct statistical models for comorbidity risk assessment and disease association analysis.
[0027] The present invention is further configured such that: the artificial intelligence diagnosis and treatment application module includes a model training unit, an evolution path recognition unit, a diagnosis and treatment plan recommendation unit, and a feedback iteration unit;
[0028] The model training unit selects training samples based on an active learning strategy, and trains a large model by combining a terminology set, a rule engine library, and a statistical model through deep learning algorithms.
[0029] The evolution path identification unit receives patient data and outputs the dynamic evolution path and key risk factors of multiple comorbidities through large model inference.
[0030] The treatment plan recommendation unit, based on the recognition results, outputs the optimal treatment plan from the matching rule engine library, including drug treatment suggestions, lifestyle intervention measures, and follow-up period suggestions.
[0031] The feedback iteration unit collects clinical treatment effects and suggestions for adjusting the treatment plan, and transmits them to the big data platform for model optimization after desensitization.
[0032] This invention also provides a method for identifying the evolution path of multiple comorbidities in the elderly based on a large model, comprising the following steps:
[0033] S1. Establish multi-source heterogeneous data collection standards, with geriatric clinical experts and information technology experts jointly defining the data collection scope, format, field attributes, and verification rules;
[0034] S2. Use OGG and CDC incremental extraction techniques to collect data related to chronic diseases in the elderly from multiple systems in the hospital and perform data integrity verification.
[0035] S3. Clean and structure the collected data, including removing junk data, deduplicating redundant data, handling missing data, and segmenting, naming entity recognition, and semantic analysis of unstructured data.
[0036] S4. Use ETL tools to classify, map, associate, integrate, and aggregate data to form a standardized dataset;
[0037] S5. Build a distributed big data platform based on Hadoop, configure storage, computing, and query components, and achieve secure data storage and efficient access.
[0038] S6. Construct a data model for multiple comorbidities in the elderly, including the construction of a terminology model and terminology set, the formulation of data quality control standards, the construction of a rule engine library, and the training of statistical models;
[0039] S7. Train a large model based on training samples and data models, and improve model performance through hyperparameter tuning and sample expansion to ensure that the accuracy of evolution path identification is ≥90%.
[0040] S8. Integrate the large model into the clinical diagnosis and treatment system to realize the identification of evolution path and the recommendation of diagnosis and treatment plan, and collect clinical feedback data;
[0041] S9. Update the big data platform data based on clinical feedback data, iteratively optimize the data model and the big model, and improve the clinical applicability of the system.
[0042] The present invention is further configured such that: in step S3, the unstructured data structuring process specifically includes: splitting the text into medical terms using multi-granularity medical word segmentation, extracting disease, symptom, and test indicator entities through medical named entity recognition, establishing the relationship between entities based on syntactic and semantic analysis, and completing data normalization processing according to preset standards.
[0043] The present invention is further configured such that: in step S6, the terminology set covers diseases, symptoms, test indicators, and risk factor categories, and clarifies the term definitions, codes, and semantic relationships; the data quality evaluation indicators include completeness ≥95%, accuracy ≥98%, and consistency ≥99%.
[0044] The present invention is further configured such that: in step S7, the large model adopts a hybrid architecture of Transformer-BERT and deep convolutional neural network, embeds the term set into the input layer, embeds the statistical model output as intermediate features into the hidden layer, and uses the accuracy of evolution path identification and the recall rate of risk factor identification as loss functions, combined with the rule engine library to constrain model inference.
[0045] The present invention also provides a device for identifying the evolution path of multiple comorbidities in the elderly based on a large model, comprising at least one processor; and a memory communicatively connected to at least one of the processors; wherein the memory stores instructions executable by the processor, the instructions being executed by the processor to implement a method for identifying the evolution path of multiple comorbidities in the elderly based on a large model.
[0046] In summary, the present invention has the following beneficial effects:
[0047] 1. This invention achieves the structuring transformation of unstructured data through technologies such as natural language processing and data verification algorithms. It combines OGG and CDC incremental extraction technologies to collect multi-system data without interference. After cleaning, integration, and normalization, a standardized dataset is formed, which solves the pain points of inconsistent data formats, scattered storage, and the existence of garbage and redundant data in multiple information systems in hospitals. At the same time, a big data platform is built based on the Hadoop distributed architecture, and a hybrid storage mode of HDFS and HBase is adopted. This not only ensures the secure storage of massive amounts of elderly chronic disease data, but also improves data access efficiency through second-level query capabilities. It solves the problems of limited storage capacity, poor sharing, and insufficient security of traditional relational databases, and provides high-quality and reusable data support for model building and clinical research.
[0048] 2. Based on the current situation of the lack of specialized disease databases and data models for chronic diseases among the elderly in China, this invention constructs a complete data model system that includes a terminology model, terminology set, clinical knowledge rule engine library, and statistical model. By using a standardized terminology set, the problem of inconsistent semantics in clinical data is solved, providing a semantic foundation for data interoperability and AI applications. The rule engine library integrates treatment guidelines and expert experience. The statistical model combines algorithms such as association rule mining and cluster analysis to achieve quantitative analysis of disease associations and risk assessment. This system fills the research gap in big data platforms and specialized models for multiple diseases among the elderly in China, breaks the limitations of the traditional single-disease specialty treatment model, and promotes the cross-border integration of geriatric medicine with big data and AI technologies.
[0049] 3. This invention leverages the deep learning capabilities of large-scale models, combined with medical statistical models and rule engine library constraints, to accurately deduce the dynamic evolution path and key risk factors of multiple comorbidities in the elderly, and automatically recommends priority-based treatment plans, including drug therapy, lifestyle interventions, and follow-up plans. This function can free doctors from tedious data organization and analysis, providing scientific and efficient intelligent auxiliary decision support for clinical diagnosis and treatment, effectively reducing the risk of misdiagnosis and missed diagnosis, improving the pertinence and effectiveness of treatment plans, and helping medical professionals deepen their understanding of the patterns of multiple comorbidities in the elderly.
[0050] 4. This invention utilizes large-scale clinical data accumulation and analysis to accurately uncover the epidemiological characteristics, risk factors, and evolution patterns of multiple comorbidities, providing data support and scientific basis for public health departments to formulate targeted prevention and control strategies. It addresses the severe public health problem of preventing and controlling multiple comorbidities among the elderly. By identifying high-risk groups in advance and predicting disease evolution trends, it achieves a shift from passive diagnosis and treatment to proactive prevention and control, which helps reduce the incidence of multiple comorbidities and adverse health outcomes among the elderly, alleviates the social medical burden, and meets the core needs of the elderly for self-directed health promotion and health management.
[0051] 5. This invention utilizes a clinical feedback iteration mechanism to send feedback data, such as treatment effects and suggestions for plan adjustments, back to the big data platform after anonymization. This data is used for continuous optimization of the data model and the large model, enabling the system to dynamically adapt to changes in clinical needs and advances in medical research. Furthermore, the big data governance, distributed platform construction, and AI-based diagnostic applications employed by the system are all mature and reliable mainstream technologies. Validated by real clinical data from top-tier hospitals, the system demonstrates strong feasibility and high stability. This allows the system to not only be promoted and applied in demonstration hospitals but also to be adapted to the information systems of different medical institutions through standardized interfaces. It provides auxiliary tools for the diagnosis and treatment of comorbidities in the elderly for hospitals at all levels, especially primary healthcare institutions, and has broad clinical application value. Attached Figure Description
[0052] Figure 1 This is a flowchart of the process interaction of a system for identifying the evolution path of multiple diseases in the elderly based on a large model, according to Embodiment 1 of the present invention.
[0053] Figure 2 This is a schematic diagram of the data cleaning and aggregation process in Embodiment 1 of the present invention;
[0054] Figure 3 This is a schematic diagram of the data model architecture for multiple comorbidities in the elderly in Embodiment 1 of the present invention;
[0055] Figure 4 This is a schematic diagram of the data architecture of the big data platform in Embodiment 1 of the present invention;
[0056] Figure 5 This is a schematic diagram of the steps in a method for identifying the evolution path of multiple comorbidities in the elderly based on a large model in Embodiment 2 of the present invention. Detailed Implementation
[0057] The following is in conjunction with the appendix Figures 1-5 The present invention will be described in further detail below.
[0058] Example 1: A system for identifying the evolution path of multiple diseases in the elderly based on a large model, comprising a data cleaning and aggregation module, a big data platform for multiple diseases in the elderly, a data model construction module, and an artificial intelligence diagnosis and treatment application module. Each module has functional units, and the units achieve data communication and logical linkage through standardized interfaces.
[0059] In this embodiment, the data cleaning and aggregation module includes a data acquisition unit, a data cleaning unit, and a data integration unit. The data acquisition unit deploys OGG software and CDC services to non-invasively collect multi-dimensional data from multiple systems in the hospital, including demographics, clinical diagnosis and treatment, laboratory tests, and imaging data of elderly patients aged 65 and above. After integrity verification, the data is transmitted to the data cleaning unit. The data cleaning unit integrates NLP tools and data verification algorithms to remove junk data, deduplicate redundant data, process missing data, and perform word segmentation, named entity recognition, and semantic analysis on unstructured data, outputting structured data to the data integration unit. The data integration unit uses ETL tools to classify and map the data according to preset standards, match field labels, and associate multi-source data using the patient's unique ID as the core. After information fusion, the standardized dataset is transmitted to the big data platform.
[0060] In this embodiment, the big data platform for multiple diseases in the elderly includes a data storage unit, a data synchronization unit, a data query unit, and a resource scheduling unit. The data storage unit employs a hybrid HDFS+HBase storage mode, receiving and storing standardized datasets transmitted by the data integration unit. HDFS stores historical full data, while HBase stores frequently queried data from the past three years. The data synchronization unit uses OGG and CDC technologies to achieve real-time incremental synchronization with the hospital's production database, monitoring the synchronization status and handling anomalies to ensure consistency between platform data and production database data. The data query unit uses Presto to build a multi-data source query interface, supporting second-level retrieval by patient ID, disease type, time range, and other dimensions, providing data access services for the data model building module and the artificial intelligence diagnosis and treatment application module. The resource scheduling unit dynamically allocates CPU, memory, and other computing resources based on Yarn to support large-scale parallel data processing and large model training tasks, ensuring efficient system operation.
[0061] In this embodiment, the data model construction module includes a terminology system construction unit, a data quality control unit, a rule engine library unit, and a statistical model unit. The terminology system construction unit establishes a terminology model and terminology set covering diseases, symptoms, test indicators, and risk factors through literature review and expert consultation, clarifying semantic relationships and providing standardized terminology support for data processing and model training. The data quality control unit formulates standardized data collection processes, standardizes processing rules and quality evaluation indicators (completeness ≥95%, accuracy ≥98%, consistency ≥99%), constructs a quality control framework, and performs quality verification on platform data. The rule engine library unit integrates treatment guidelines, expert experience, and epidemiological data, extracts rules for disease association, risk factors, and treatment matching, encodes them, and stores them in a Redis cache, providing logical constraints for model inference. The statistical model unit extracts high-quality data from the platform and uses algorithms such as Apriori algorithm, logistic regression, and K-Means clustering to construct comorbidity risk assessment and disease association analysis models, which are then verified and transmitted to the artificial intelligence diagnosis and treatment application module.
[0062] In this embodiment, the AI-powered diagnostic and treatment application module includes a model training unit, an evolution path identification unit, a treatment plan recommendation unit, and a feedback iteration unit. The model training unit uses platform data selected as training samples based on an active learning strategy. It combines a terminology set, a rule engine library, and a statistical model to train a large model using deep learning algorithms, optimizing model parameters to meet clinical accuracy requirements. The evolution path identification unit receives patient data input from the clinic, calls the trained large model for inference, outputs the dynamic evolution path of multiple comorbidities (including the probability of disease occurrence at each stage) and key risk factors, and transmits this information to the treatment plan recommendation unit. The treatment plan recommendation unit matches the results from the rule engine library and the statistical model, generates three priority-ranked treatment plans (including drug therapy, lifestyle intervention, and follow-up plans), and feeds this information back to the clinical system. The feedback iteration unit collects clinical treatment effects and suggestions for plan adjustments, de-identifies them, and transmits them to the big data platform, providing input for data updates and model iterations.
[0063] This embodiment addresses the cleaning and integration of chronic disease data in the elderly. Hospital information systems are highly diverse, and the complex sources and types of data pose significant challenges to data aggregation and utilization. Considering the characteristics of elderly patients' medical data—originating from multiple systems, varying data formats, and exhibiting non-standard, incomplete, and disorganized data records—a standard for collecting multi-source, heterogeneous data on multiple chronic diseases in the elderly was jointly developed by geriatric and IT experts. Natural language processing technology was used to transform medical text information into structured medical data, removing redundant and irrelevant data. The data was then aggregated and integrated to improve data quality. The data cleaning and aggregation process is as follows: Figure 2 As shown.
[0064] This embodiment, in constructing a data model for multiple comorbidities in the elderly, follows relevant clinical research guidelines, utilizes big data platforms on multiple comorbidities in the elderly, and combines this with the characteristics of research on multiple comorbidities in the elderly to determine the core elements of the data model's management and data layers. Through a combination of extracting medical data from information systems, collecting literature, consulting experts, and conducting empirical research, data mining methods are used to establish a terminology corpus. From this corpus, terminology entries are selected to create a terminology set containing clinical information, imaging, and multi-omics content, forming a chronic disease knowledge rule engine library. Medical statistical models are embedded in the data analysis to predict risk factors and evolution paths of multiple comorbidities in the elderly, enabling disease prediction and early warning, and optimizing clinical treatment plans. The overall data model architecture is as follows: Figure 3 As shown.
[0065] In this embodiment, regarding the construction of a big data platform based on a Hadoop distributed architecture, through analysis and comparison, two methods are specifically used to incrementally extract data from the databases of various information systems: OGG (Oracle GoldenGate) software and CDC (Change Data Capture) technology. OGG is an Oracle product, operating outside the database in a non-intrusive manner, with almost no impact on the database. CDC is a built-in function of SQL Server databases after 2008, offering high security and stability. Both incremental data extraction methods will not cause downtime in the hospital's production database, and can, to a certain extent, maintain the synchronization between the big data platform data and the hospital's HIS database data.
[0066] After processing the data extracted from various information systems, the hospital's relevant clinical data from many years is aggregated to build a big data sharing platform based on a Hadoop distributed architecture. The data undergoes post-structuring, word segmentation, and normalization processing. Based on distributed storage and computing, the platform enables sub-second query times and result display. The big data platform is the most effective research mechanism for storing and using data; it not only aggregates and stores massive amounts of data but also effectively improves the efficiency of data use and analysis. The entire big data platform's data architecture is as follows: Figure 4 As shown.
[0067] In the field of AI-based diagnosis and treatment applications, this embodiment uses rapidly developing AI technologies such as deep learning and natural language processing to perform deep learning on medical texts, clinical symptoms, lesion identification, and treatment methods for various chronic diseases in the elderly. It comprehensively analyzes and structures the characteristics affecting the symptoms of the elderly, deduces the evolution path and development trend of various chronic diseases in the elderly, and intelligently recommends the best treatment plan for the diagnosis and treatment of chronic diseases in the elderly.
[0068] Example 2: A method for identifying the evolution path of multiple comorbidities in the elderly based on a large model, comprising the following steps:
[0069] S1. Establish multi-source heterogeneous data collection standards, with geriatric clinical experts and information technology experts jointly defining the data collection scope, format, field attributes, and verification rules.
[0070] A joint team of geriatric clinical experts and information technology experts was established, with clinical experts responsible for defining the scope of medical data and information technology experts responsible for data format and technology adaptation solutions. A comprehensive survey of the hospital's existing information systems, including HIS, EMRS, LIS, and RIS, was conducted to identify data types (structured, semi-structured, and unstructured), data structures, and storage methods, resulting in a "Data Source Survey Report." Referring to domestic and international medical terminology standards such as BBMRI-ERIC and OMOP / OHDSI, as well as guidelines for the diagnosis and treatment of chronic diseases in the elderly, the scope of data collection was defined: demographic data, clinical diagnosis and treatment data, laboratory data, imaging data, and health function data (ADL, IADL, depression level). Data field attributes were defined, including entity-time attributes, entity-quantity attributes, entity-logical judgments, and entity-entity relationships, with standardized field naming and data types. Data verification rules were developed, covering numerical range verification, format verification, and integrity verification, resulting in a standardized "Handbook for Data Collection of Multiple Comorbidities in the Elderly."
[0071] S2. Use OGG and CDC incremental extraction techniques to collect data related to chronic diseases in the elderly from multiple systems in the hospital and perform data integrity verification.
[0072] Technical Deployment: OGG software and CDC service are deployed on the hospital's production database side. OGG's non-intrusive incremental extraction process is configured for Oracle databases, and CDC's native functionality is enabled for SQL Server databases to ensure that data collection does not affect the production database's operation. Data Collection Task Configuration: Target fields are selected according to the "Data Collection Standard Manual," and filter conditions are set (age ≥ 65 years, data generation time ≥ 2010). Real-time incremental synchronization strategies (delay ≤ 10 minutes) and daily full-data verification tasks are configured. Data Reception and Caching: Data from various systems is received through a dedicated interface and temporarily stored in a distributed Redis cache queue. Data sharding is used to avoid single points of failure. Integrity Verification: Key field integrity checks are performed on the collected data. Missing data is temporarily stored in an exception queue and an alarm is triggered. After manual verification, the missing data is either supplemented or removed.
[0073] S3. Clean and structure the collected data, including removing junk data, deduplicating redundant data, handling missing data, and segmenting, naming entity recognition, and semantic analysis of unstructured data.
[0074] Junk data removal: Based on preset verification rules, invalid data with abnormal values, incorrect formats, and logical conflicts is identified and removed through data verification algorithms; Redundant data removal: A hash deduplication algorithm is used to calculate the hash value of duplicate medical records for the same patient, retain the latest data, and merge duplicate fields; Missing data handling: When key fields are missing, the clinical knowledge rule engine library is called to complete them based on other patient data; When non-key fields are missing, they are marked as "NA" and the reason for the missing is recorded; Unstructured data structuring transformation: Multi-granular medical word segmentation: A word segmentation tool combining dictionary and deep learning is used to split medical record text and image reports into standardized medical terms; Medical named entity recognition: Entities such as diseases, symptoms, and test indicators are extracted and labeled with types through a BERT-based pre-trained model; Syntactic and semantic analysis: Based on the medical semantic network, the association relationships between entities such as symptoms-diseases and drugs-indications are established; Data normalization: Terminology names, unit conversions, and diagnostic codes (using ICD-11 encoding) are unified according to the collection standards.
[0075] S4. Use ETL tools to classify, map, associate, integrate, and aggregate data to form a standardized dataset.
[0076] Data classification and mapping: The cleaned data is classified into five categories—"patient basic information, medical history, medical records, test results, and health functions"—using ETL tools and assigned uniform field labels; Patient unique identifier association: Using the patient ID as the primary key, cross-system data is linked to achieve binding of HIS medical records, LIS test results, and RIS imaging reports; Information fusion: Based on entity-entity relationship rules, the associated data is merged to form a complete patient comorbidity profile; Data aggregation: The integrated standardized dataset is loaded into the preprocessing storage area of the big data platform using ETL tools, and data lineage information is recorded for traceability.
[0077] S5. Build a distributed big data platform based on Hadoop, configure storage, computing, and query components, and achieve secure data storage and efficient access.
[0078] Cluster Deployment: A Hadoop cluster consisting of 3 NameNodes and 12 DataNodes is deployed, along with a 3-node Zookeeper cluster for high availability. The operating system is CentOS 7.9, and the Hadoop version is 3.3.4. Core Component Configuration: Storage Component: HDFS block size is set to 128MB with 3 replicas; HBase has 4 column families and uses a pre-partitioning strategy. Computation Component: Spark 3.2.4 is deployed, a Yarn capacity scheduler is configured, and 30% of emergency resources are reserved. Query Component: Presto 0.280 is deployed, multiple data source connections are configured, and a composite index is established. Data Storage and Loading: Preprocessed data is loaded into HDFS (historical full data) and HBase (high-frequency data from the past 3 years), and Snappy / LZO compression is enabled to optimize storage. Data Synchronization Configuration: A synchronization monitoring job is started to monitor synchronization latency, data consistency, and other metrics. Automatic alerts are issued and data is resent when anomalies occur. Security Configuration: Access permissions are assigned using the RBAC permission model. Sensitive data is stored with AES-256 encryption, HTTPS protocol is used for transmission, and audit logs are enabled (retained for 3 years).
[0079] S6. Construct a data model of multiple diseases in the elderly, including the construction of a terminology model and terminology set, the formulation of data quality control standards, the construction of a rule engine library, and the training of statistical models.
[0080] Terminology System Construction: Literature Review and Expert Consultation: Retrieved core literature and international standards, and invited 12 clinical experts to participate in terminology screening; Terminology Set Construction: Screened 12,000+ terms, clarifying their definitions, codes, and semantic relationships; Terminology Model Construction: Modeled using Protégé based on ontology, achieving synonym mapping and hierarchical association; Rule Engine Library Construction: Rule Extraction: Integrated treatment guidelines, expert consensus, and epidemiological data, extracting 800+ disease association, risk factor, and treatment matching rules; Rule Encoding: Encoded using the IF-THEN production rule representation and stored in Redis cache; Statistical Model Construction: Data Extraction: Extracted 100,000 training data points and 20,000 validation data points (including follow-up records of 5 years or more); Feature Engineering: Screened feature variables and eliminated redundancy; Model Training: Used Apriori algorithm, logistic regression, and K-Means clustering algorithm to construct association analysis, risk prediction, and subtype classification models; Model Validation: Deployed after ensuring association rule accuracy ≥85%, risk prediction AUC ≥0.88, and cluster silhouette coefficient ≥0.75.
[0081] S7. Train a large model based on training samples and data models, and improve model performance through hyperparameter tuning and sample expansion to ensure that the accuracy of evolution path identification is ≥90%.
[0082] Training sample selection: 150,000 high-quality samples were selected based on an active learning strategy, annotated by 5 experts (evolutionary path, risk factors, treatment plan), and after cross-validation consistency ≥90%, the training set, validation set, and test set were divided in a 7:2:1 ratio; Large model architecture design: A hybrid architecture of Transformer-BERT and deep convolutional neural network was adopted, with text and numerical branches in the input layer, and embedding terminology set and statistical model features; Model training: The learning rate was set to 0.001, 100 iterations, and batch size to 32, using the AdamW optimizer, with the joint loss function of evolutionary path identification accuracy and risk factor recall, and a rule engine was introduced to constrain inference; Model optimization: Hyperparameters were tuned through grid search, and 50,000 rare comorbidity samples were added for incremental training, and structured pruning was used to improve inference speed by 30%; Model testing: The clinical requirements were met if the evolutionary path identification accuracy ≥92% and the risk factor recall ≥90% on the test set were achieved.
[0083] S8. Integrate the large model into the clinical diagnosis and treatment system to realize the identification of evolution path and the recommendation of diagnosis and treatment plan, and collect clinical feedback data.
[0084] Interface Development: Generate RESTful API interfaces, define input (patient ID / dataset) and output (evolution path, risk factors, treatment plan) parameters, with a response time ≤10 seconds and a concurrency ≥1000 QPS; Integration Testing: Conduct functional testing (1000 consecutive calls without errors), compatibility testing (adaptation to mainstream systems and browsers), and exception handling testing; Clinical Deployment: Add functional modules to the doctor's workstation, supporting two data input methods: patient ID query and manual entry, authorized by department, and implementing data permission isolation; Identification and Recommendation Process: After the doctor submits the data, the large model outputs a visualized evolution path (including disease occurrence probability) and ranked risk factors within 10 seconds, and the matching rule engine outputs 3 priority treatment plans (including evidence-based basis).
[0085] S9. Update the big data platform data based on clinical feedback data, iteratively optimize the data model and the big model, and improve the clinical applicability of the system.
[0086] Feedback data collection: Doctors enter treatment effects, treatment plan adjustment opinions, and system usage feedback to the feedback module; Feedback data processing: After anonymization, the data is transmitted to the big data platform's feedback storage area; Data updates: Newly collected data and feedback data are cleaned and integrated every quarter to update the platform's dataset; Model iteration: Every six months, the terminology set, rule engine library, and model parameters are optimized based on updated data; Function optimization: Operation processes are simplified, visualization effects are optimized, and personalized functions are added based on feedback.
[0087] Example 3: A device for identifying the evolution path of multiple comorbidities in the elderly based on a large model, comprising at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the processor, which are used to implement a method for identifying the evolution path of multiple comorbidities in the elderly based on a large model, the method comprising: establishing multi-source heterogeneous data collection standards, with geriatric clinical experts and information technology experts jointly defining the data collection scope, format, field attributes, and verification rules; using OGG and CDC incremental extraction techniques to collect geriatric chronic disease-related data from multiple systems in hospitals, and performing data integrity verification; cleaning and structuring the collected data, including removing junk data, deduplicating redundant data, handling missing data, and performing word segmentation, named entity recognition, and semantic processing on unstructured data. The process involves: semantic analysis; data classification, mapping, association, integration, and aggregation using ETL tools to form standardized datasets; building a Hadoop-based distributed big data platform, configuring storage, computing, and query components to achieve secure data storage and efficient access; constructing a data model of multiple comorbidities in the elderly, including terminology model and terminology set construction, data quality control standard formulation, rule engine library construction, and statistical model training; training a large model based on training samples and data models, improving model performance through hyperparameter tuning and sample expansion to ensure an evolution path identification accuracy of ≥90%; integrating the large model into the clinical diagnosis and treatment system to achieve evolution path identification and treatment plan recommendation, and collecting clinical feedback data; updating the big data platform data based on clinical feedback data, iteratively optimizing the data model and the large model to improve the system's clinical applicability.
[0088] This specific embodiment is merely an explanation of the present invention and is not intended to limit the invention. After reading this specification, those skilled in the art can make modifications to this embodiment without contributing any inventive step, but such modifications are protected by patent law as long as they are within the scope of the claims of the present invention.
Claims
1. A system for identifying the evolutionary path of multiple comorbidities in the elderly based on a large model, characterized in that: It includes a data cleaning and aggregation module, a big data platform for multiple diseases in the elderly, a data model building module, and an artificial intelligence diagnosis and treatment application module. Each module achieves data interoperability and logical linkage through standardized interfaces. The data cleaning and aggregation module is used to connect with multiple information systems in the hospital to collect, clean, structure, and integrate multi-source heterogeneous medical data, and output standardized datasets. The big data platform for multiple diseases among the elderly is built on the Hadoop distributed architecture and is used to provide data storage, real-time synchronization, efficient query and computing resource scheduling services, and to provide data support for model building and large model training. The data model building module is used to establish a standardized terminology system, a clinical knowledge rule engine library, and a statistical model, providing medical logical constraints and a data foundation for large-scale model training. The AI-powered diagnosis and treatment application module is based on a large model to identify the evolution path of multiple comorbidities in the elderly, analyze risk factors, and intelligently recommend treatment plans. It also receives clinical feedback data for system iteration and optimization.
2. The system for identifying the evolution path of multiple comorbidities in the elderly based on a large model according to claim 1, characterized in that: The data cleaning and aggregation module includes a data acquisition unit, a data cleaning unit, and a data integration unit; The data acquisition unit uses OGG and CDC incremental extraction technology to connect with the hospital's HIS, EMRS, LIS, and RIS information systems to acquire basic patient information, medical records, test results, and imaging reports for elderly patients aged 65 and above. The data cleaning unit integrates natural language processing tools and data verification algorithms to complete the removal of junk data, deduplication of redundant data, completion and marking of missing data, and structured transformation of unstructured data. The data integration unit, based on preset data standards, uses ETL tools to classify and map structured data, match field labels, and fuse information to form a complete patient comorbidity data archive.
3. The system for identifying the evolution path of multiple comorbidities in the elderly based on a large model according to claim 1, characterized in that: The big data platform for multiple diseases among the elderly includes a data storage unit, a data synchronization unit, a data query unit, and a resource scheduling unit. The data storage unit adopts a hybrid storage mode of HDFS and HBase, with HDFS storing massive historical data and HBase storing frequently queried data. The data synchronization unit achieves real-time incremental synchronization with the hospital's Oracle, SQL Server, and DB2 production databases through OGG non-invasive acquisition and CDC native functions, without affecting the operation of the production databases. The data query unit is based on Presto to realize second-level interactive query of multiple data sources, and supports multi-dimensional retrieval by patient ID, disease type, and time range; The resource scheduling unit dynamically allocates computing resources based on Yarn, supporting large-scale data parallel processing and large model training tasks.
4. The system for identifying the evolution path of multiple comorbidities in the elderly based on a large model according to claim 1, characterized in that: The data model construction module includes a terminology system construction unit, a data quality control unit, a rule engine library unit, and a statistical model unit. The terminology system construction unit establishes a terminology model and terminology set for comorbidities in the elderly through research on domestic and international medical terminology standards, expert consultation, and literature analysis, clarifying the semantic relationships between diseases, symptoms, and test indicators; The data quality control unit formulates standardized data acquisition procedures, standardized data processing rules, and quality evaluation indicators to construct a data quality control framework. The rule engine library unit integrates guidelines for the diagnosis and treatment of comorbidities in the elderly, expert experience, and epidemiological research results, extracts disease association rules and risk factor rules, and encodes and stores them using a production rule representation method. The statistical model unit employs association rule mining, logistic regression, and cluster analysis algorithms, combined with clinical data and multi-omics data, to construct statistical models for comorbidity risk assessment and disease association analysis.
5. The system for identifying the evolution path of multiple comorbidities in the elderly based on a large model according to claim 1, characterized in that: The AI-powered diagnostic and treatment application module includes a model training unit, an evolution path recognition unit, a treatment plan recommendation unit, and a feedback iteration unit. The model training unit selects training samples based on an active learning strategy, and trains a large model by combining a terminology set, a rule engine library, and a statistical model through deep learning algorithms. The evolution path identification unit receives patient data and outputs the dynamic evolution path and key risk factors of multiple comorbidities through large model inference. The treatment plan recommendation unit, based on the recognition results, outputs the optimal treatment plan from the matching rule engine library, including drug treatment suggestions, lifestyle intervention measures, and follow-up period suggestions. The feedback iteration unit collects clinical treatment effects and suggestions for adjusting the treatment plan, and transmits them to the big data platform for model optimization after desensitization.
6. A method for identifying the evolution path of multiple comorbidities in the elderly based on a large model, applied to a system for identifying the evolution path of multiple comorbidities in the elderly based on a large model as described in any one of claims 1-5, characterized in that: Includes the following steps: S1. Establish multi-source heterogeneous data collection standards, with geriatric clinical experts and information technology experts jointly defining the data collection scope, format, field attributes, and verification rules; S2. Use OGG and CDC incremental extraction techniques to collect data related to chronic diseases in the elderly from multiple systems in the hospital and perform data integrity verification. S3. Clean and structure the collected data, including removing junk data, deduplicating redundant data, handling missing data, and segmenting, naming entity recognition, and semantic analysis of unstructured data. S4. Use ETL tools to classify, map, associate, integrate, and aggregate data to form a standardized dataset; S5. Build a distributed big data platform based on Hadoop, configure storage, computing, and query components, and achieve secure data storage and efficient access. S6. Construct a data model for multiple comorbidities in the elderly, including the construction of a terminology model and terminology set, the formulation of data quality control standards, the construction of a rule engine library, and the training of statistical models; S7. Train a large model based on training samples and data models, and improve model performance through hyperparameter tuning and sample expansion to ensure that the accuracy of evolution path identification is ≥90%. S8. Integrate the large model into the clinical diagnosis and treatment system to realize the identification of evolution path and the recommendation of diagnosis and treatment plan, and collect clinical feedback data; S9. Update the big data platform data based on clinical feedback data, iteratively optimize the data model and the big model, and improve the clinical applicability of the system.
7. The method for identifying the evolution path of multiple comorbidities in the elderly based on a large model according to claim 6, characterized in that: In step S3, the unstructured data structuring process specifically includes: splitting the text into medical terms using multi-granularity medical word segmentation, extracting disease, symptom, and test indicator entities through medical named entity recognition, establishing relationships between entities based on syntactic and semantic analysis, and completing data normalization processing according to preset standards.
8. The method for identifying the evolution path of multiple comorbidities in the elderly based on a large model according to claim 6, characterized in that: In step S6, the terminology set covers diseases, symptoms, test indicators, and risk factor categories, and clarifies the term definitions, codes, and semantic relationships. The data quality evaluation indicators include completeness ≥95%, accuracy ≥98%, and consistency ≥99%.
9. The method for identifying the evolution path of multiple comorbidities in the elderly based on a large model according to claim 6, characterized in that: In step S7, the large model adopts a hybrid architecture of Transformer-BERT and deep convolutional neural network, embeds the term set into the input layer, embeds the statistical model output as intermediate features into the hidden layer, and uses the accuracy of evolution path identification and the recall of risk factor identification as loss functions, combined with the rule engine library to constrain model inference.
10. A device for identifying the evolution path of multiple comorbidities in the elderly based on a large model, characterized in that: It includes at least one processor; and a memory communicatively connected to at least one of the processors; wherein the memory stores instructions executable by the processor to implement a method for identifying the evolution path of multiple diseases in the elderly based on a large model, as described in any one of claims 6-9.