A medical service cost accounting data cleaning and standardization method
By constructing multi-source data interfaces and hybrid machine learning models, and combining medical business semantic features for data cleaning and standardization, the problems of inaccurate data cleaning and low automation in medical service cost accounting are solved, achieving efficient and accurate data processing and consistency of accounting results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHAN DONG MSUN HEALTH TECH GRP CO LTD
- Filing Date
- 2026-03-03
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies lack specific data cleaning and standardization methods for medical scenarios in medical service cost accounting, resulting in inaccurate data cleaning, low automation, and a high proportion of manual intervention, which affects the accuracy and consistency of accounting results.
Build multi-source data interfaces, use preset rules and hybrid machine learning models to filter and process data, and combine medical business semantic features and standardized mapping to achieve fully automated data cleaning and standardization.
It improves the accuracy and automation of data cleaning, reduces human intervention, ensures data quality and consistency of accounting results, and enhances accounting accuracy and efficiency.
Smart Images

Figure CN122196362A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a method for cleaning and standardizing medical service cost accounting data, belonging to the field of medical service item cost accounting technology. Background Technology
[0002] In the field of cost accounting for medical services, the accuracy of cost data directly determines the reliability and precision of the accounting results, and the collection and preprocessing of raw data is the primary step in cost accounting. Currently, the data for medical service cost accounting mainly comes from multiple business systems in hospitals, such as Hospital Information System (HIS), Laboratory Information Management System (LIS), and Materials Management System. The inconsistent construction time and data standards of these systems lead to numerous problems with the raw data.
[0003] In existing technologies, data cleaning and standardization often employ common data processing methods, such as outlier filtering based on fixed thresholds, manual completion of missing data, and simple field format standardization. These methods reveal significant shortcomings and deficiencies in the context of healthcare service cost accounting, as detailed below: Lack of data cleaning rules specific to the medical scenario General cleaning methods cannot adapt to the specific characteristics of medical cost data. For example, medical consumables exhibit the phenomenon of "one item, multiple codes" (the same consumable has different codes in different departments and different batches of goods entering the warehouse). General methods can only identify format errors and cannot achieve standardized mapping of codes. Human resource cost data includes medical-specific work hour types such as "on-call hours" and "overtime hours." General methods cannot distinguish the contribution weight of different work hours to costs, leading to subsequent accounting errors.
[0004] Low precision in handling missing and outlier values Missing or anomaly data in healthcare costs are context-dependent. For example, missing "equipment depreciation data" for a department might be due to newly purchased equipment not yet having completed depreciation registration, rather than random omissions; abnormal peaks in "single-disease cost data" might be due to the admission of complex or severe cases for that disease, rather than data errors. Existing common methods often employ simple techniques such as mean filling or directly deleting outliers, ignoring the context-dependent nature of healthcare data. This leads to data distortion after cleaning, affecting the accuracy of cost accounting.
[0005] Insufficient data standardization and poor compatibility with subsequent accounting processes Existing standardization methods only reach the level of uniform field formats, without establishing a dedicated data standard system for medical cost accounting. For example, the names of "medical service items" in different departments are often a mix of colloquial and standardized names (such as "electrocardiogram" and "routine electrocardiogram examination"). General standardization methods cannot achieve accurate mapping between names and national medical service item codes. Furthermore, the classification standards for cost data do not match the needs of subsequent equivalent fitting and item weighting calculations, requiring secondary manual processing during data transfer and reducing the automation level of the accounting process.
[0006] The high rate of manual intervention results in low efficiency and a high susceptibility to human error. Due to the lack of automated data cleaning tools adapted to medical scenarios, over 90% of the work in existing data processing workflows relies on manual labor, including manually identifying abnormal data, manually matching codes, and completing missing information. This not only consumes a significant amount of manpower but also introduces errors due to the subjectivity of human judgment, resulting in significant differences in data quality between different batches and processed by different personnel, leading to a lack of consistency in the calculation results.
[0007] In summary, existing general data cleaning and standardization methods cannot meet the high-precision and high-automation requirements of medical service cost accounting. There is an urgent need for a dedicated data cleaning and standardization method adapted to medical scenarios to solve the problems of dirty, messy, and complex raw data, and to provide a high-quality data foundation for subsequent equivalent calibration, itemized weight calculation, and end-to-end architecture integration. Summary of the Invention
[0008] The purpose of this invention is to provide a method for cleaning and standardizing medical service cost accounting data. Based on the scenario characteristics of medical cost data and combined with a medical-specific standardized mapping system, this method achieves fully automated processing from raw data input to standardized data output, solving the problems of poor adaptability, low data quality, and insufficient automation of existing technologies in medical scenarios.
[0009] To achieve the above objectives, the present invention employs the following technical solution: A method for cleaning and standardizing medical service cost accounting data includes the following steps: S1. Construct a multi-source data interface to read raw data from the hospital's business system, extract core fields, and unify them into a standard field set based on a preset field mapping table to form a structured raw dataset; S2. The structured raw dataset is filtered using preset format validation rules, business validation rules, and integrity validation rules, and the data is divided into valid data, abnormal data to be processed, high-priority missing data, and low-priority missing data. S3. For the abnormal data to be processed and the high-priority missing data, construct a feature vector containing medical business semantic features, and input it into a hybrid machine learning model for processing; wherein, for the abnormal data to be processed, use an integrated model to identify the probability of abnormality and combine business logic classification to perform quantile constraint correction or add reasonable abnormal labels, and use a weighted adaptive KNN model or time series model to complete the high-priority missing data. S4. Standardize the processed data according to national medical service standards for project names and codes, cost classification and units, and output a standardized cost dataset; S5. Perform quality indicator verification on the standardized cost dataset. If the quality indicator fails to meet the standard, adjust the rule engine parameters and supplement the training set with the corrected data that has been manually reviewed and confirmed to dynamically optimize the hybrid machine learning model.
[0010] Preferably, in step S1, the business system includes a hospital information system (HIS), a materials management system, and a financial system; the format of the raw data includes Oracle database tables, Excel files, and XML messages; the core fields include medical service item code, medical service item name, department code, department name, consumable usage, consumable unit price, staff wages, equipment depreciation duration, equipment depreciation rate, and revenue and expenditure amount.
[0011] Preferably, the process of constructing the feature vector in step S3 includes: Extract basic field features and medical business semantic features, including business correlation coefficients, time dimension features, and anomaly cause features; Numerical features are subjected to Min-Max normalization, and categorical features are subjected to one-hot encoding. L1 regularization is introduced for feature selection. Redundant features are removed and the core effective features are retained to form the final feature vector.
[0012] Preferably, the processing of the abnormal data to be processed in step S3 specifically includes: The random forest-AdaBoost ensemble model is used to calculate the anomaly probability of the data. When the anomaly probability is higher than a dynamic threshold, the data is identified as anomaly data that needs to be corrected. The calculation formula is as follows: , in, For the first The weights of each decision tree To determine the total amount of the decision tree, For the first Decision trees for samples The predicted probability of anomalies; The dynamic threshold The calculation formula is as follows: , in, This represents the mean probability of historical outlier samples. The standard deviation of the probability of historical outlier samples; Based on business logic, the abnormal data to be corrected is divided into data error type abnormalities and business reasonable type abnormalities; For data error-related anomalies, a reasonable quantile interval is calculated using linear interpolation based on historical data from the same department, service items, and time period. The formula for linear interpolation is as follows: , in, Indicates quantile value quantiles, For the first The values of each sample, Quantity value This refers to the historical sample size. Outliers are corrected to the value closest to the historical median within the specified interval, and are marked for manual review if the correction exceeds a preset threshold; the correction magnitude is calculated using the following formula: , in, For corrected outliers, This is an outlier; For business-related anomalies, retain the original values and add reasonable anomaly markers and explanations of the causes.
[0013] Preferably, the processing of high-priority missing data in step S3 specifically includes: High-priority missing data is categorized into random missing data and non-random missing data. For randomly missing data, a weighted adaptive KNN model is used for completion. When calculating the similarity between neighboring samples, strong correlation fields are given high weights and weak correlation fields are given low weights. The completion is performed by a weighted average formula, and the weights are obtained by weighting the business correlation similarity and numerical feature cosine similarity between the randomly missing data and neighboring samples. For non-random missing data, if historical time-series data exists, a lightweight LSTM model is used for prediction, and threshold constraints are applied in conjunction with medical business rules. If no time-series data exists, the latest data from similar products is used to fill in the missing data.
[0014] Preferably, the threshold constraints of the medical business rules include: for the supplementary value of the equipment depreciation rate, it is limited to not exceeding 10% of the initial depreciation rate of the same type of equipment.
[0015] Preferably, the high-priority missing data is further validated after completion, including: Verify whether the completed data meets the format verification rules, business verification rules, and integrity verification rules. At the same time, calculate the deviation rate between the completed data and the data of the same department, the same medical service project, and the same time period. If the deviation rate exceeds 20%, trigger a second completion. Add business rule threshold constraints to the historical time series data of non-random missing data.
[0016] Preferably, the quality indicators in step S5 include outlier rate, missing value rate, and standardized matching rate; the dynamic optimization of the hybrid machine learning model specifically includes: supplementing the training set with corrected data that has been manually reviewed and confirmed every quarter, removing old samples that have exceeded the preset age, and automatically adjusting the outlier probability threshold of the ensemble model and the number of neighbors of the KNN model based on the accounting deviation.
[0017] The advantages of this invention are as follows: It introduces a medical-specific business rule engine and a scenario-based machine learning model. Through feature engineering to enhance medical semantic association, integrated models to accurately identify anomaly types, and scenario-specific correction strategies, it can accurately distinguish between data errors and reasonable business fluctuations. Specifically, compared to traditional single algorithms, the machine learning module improves anomaly identification accuracy by 8-10 percentage points and missing value completion accuracy by 12-15 percentage points. Practical application verification shows that the anomaly rate of cleaned data has decreased from 15%-20% using traditional methods to below 0.5%, and the data matching accuracy has increased to over 95%, providing a high-quality data foundation for subsequent equivalent fitting calibration and multi-dimensional weighted item calculation.
[0018] This invention automates the entire process from multi-source data access, cleaning, standardization to verification, requiring minimal manual intervention. Compared to traditional manual processing methods, it improves efficiency by over 80%, while eliminating subjective errors from human judgment and ensuring consistency in processing results across different batches of data.
[0019] The standardized system of this invention is fully compatible with the downstream needs of medical service cost accounting. The standardized dataset can be directly connected to the equivalent fitting model and the multi-dimensional weighted sub-item accounting model without secondary manual conversion, providing key support for the automated integration of the whole process cost accounting architecture.
[0020] The parameters of the rule engine and the training set of the machine learning model support dynamic optimization, which can be adapted to the cost data characteristics of hospitals at different levels. At the same time, the data access interface supports the expansion of new business systems. New data sources can be accessed simply by adding the corresponding parsing rules, which has broad application value. Attached Figure Description
[0021] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used together with the embodiments of the invention to explain the invention and do not constitute a limitation thereof.
[0022] Figure 1 This is a schematic diagram of the method flow of the present invention. Detailed Implementation
[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] Example 1 like Figure 1 As shown, a method for cleaning and standardizing medical service cost accounting data includes the following steps: S1. Construct a multi-source data interface to read raw data from the hospital's business system, extract core fields, and unify them into a standard field set based on a preset field mapping table to form a structured raw dataset; S2. The structured raw dataset is filtered using preset format validation rules, business validation rules, and integrity validation rules, and the data is divided into valid data, abnormal data to be processed, high-priority missing data, and low-priority missing data. S3. For the abnormal data to be processed and the high-priority missing data, construct a feature vector containing medical business semantic features, and input it into a hybrid machine learning model for processing; wherein, for the abnormal data to be processed, use an integrated model to identify the probability of abnormality and combine business logic classification to perform quantile constraint correction or add reasonable abnormal labels, and use a weighted adaptive KNN model or time series model to complete the high-priority missing data. S4. Standardize the processed data according to national medical service standards for project names and codes, cost classification and units, and output a standardized cost dataset; S5. Perform quality indicator verification on the standardized cost dataset. If the quality indicator fails to meet the standard, adjust the rule engine parameters and supplement the training set with the corrected data that has been manually reviewed and confirmed to dynamically optimize the hybrid machine learning model.
[0025] As a refinement of the above embodiments, in step S1, the business system includes a hospital information system (HIS), a materials management system, and a financial system, and the format of the raw data includes Oracle database tables, Excel files, and XML messages. Automated reading is achieved through a data access module written in Python.
[0026] The original data is structured and parsed to extract the core fields for medical cost accounting, including: medical service item code, medical service item name, department code, department name, consumable usage, consumable unit price, personnel wages, equipment depreciation duration, equipment depreciation rate, and revenue and expenditure amount.
[0027] Establish a data field mapping table to uniformly map non-standard fields from different systems (such as the "drug name" field in one system being named "free drugs" in another system) to a preset core field set, forming a structured raw dataset.
[0028] As a refinement of the above embodiments, step S2 involves designing a rule engine specifically for medical cost data to filter out obvious erroneous data through preset rules. These rules specifically include three categories: Format validation rules: Validate the format of core fields. For example, the code for medical service items must conform to the 10-digit coding rule of the national "Specification for Medical Service Price Items". If it does not conform, it will be marked as "abnormal format data". The working hour data must be non-negative values. Negative values will be marked as "logical error data".
[0029] Business verification rules: Set verification thresholds based on medical business logic. For example, the usage of a single consumable usage record should not exceed 50% of the total inventory of the department in the current month (to avoid duplicate entries), and logistics and medical support departments should not have any invoice data.
[0030] Integrity verification rules: Identify missing core fields and mark their priority according to the type of missing field. For example, missing medical service item codes are classified as "high priority missing" (must be filled in), while missing remarks information are classified as "low priority missing" (does not affect accounting and can be ignored).
[0031] The rule engine module, written in Python, executes the above rules one by one on the structured raw dataset to filter out valid data (correct data, no processing required), high-priority missing data (data missing in important fields), abnormal data to be processed (data is not missing, but may be incorrect), and low-priority missing data (data missing in unimportant fields). Valid data and low-priority missing data directly enter the next stage, while high-priority missing data and abnormal data to be processed enter the next step of machine learning correction.
[0032] As a refinement of the above embodiments, step S3 employs a machine learning scheme of "scene feature enhancement + hybrid model integration" for the abnormal data and high-priority missing data to be processed selected by the rule engine. This breaks through the limitations of traditional general algorithms and achieves accurate correction of medical cost data. The core innovation lies in introducing medical business semantic features and a dynamic adaptive mechanism. The specific process and algorithm design are as follows: S301: Before inputting the model, design a dedicated feature engineering process for the scenario-specific relevance of medical cost data to avoid model bias caused by generic features, including: (1) Extract basic field features and medical business semantic features, wherein the medical business semantic features include business correlation coefficients, time dimension features and abnormal cause features; Basic fields include department type, type of medical service, seasonal factors, and patient flow. The semantic features of medical business include: ① business correlation coefficients (such as the binding relationship between surgical projects and consumable types, and the saturation of departmental business volume); ② time dimension features (equipment service life, consumable warehousing cycle, and monthly / quarterly accounting cycle weight); ③ abnormal cause features (marking of difficult cases, equipment maintenance status, and temporary policy adjustment indicators), constructing a total of 12-18 dimension feature vectors to cover the core medical scenario causes of data anomalies / missing data.
[0033] (2) Perform Min-Max normalization on numerical features (such as patient flow and equipment service life) and perform one-hot encoding on categorical features; The normalization formula is as follows: in, The values are normalized. For the original numerical features, , These are the minimum and maximum values of the feature in the training set, respectively. After normalization, the feature values are mapped to the [0,1] interval to adapt to the range requirements of the ensemble model for input features.
[0034] (3) L1 regularization is introduced for feature selection. Redundant features (such as feature pairs with correlation > 0.85) are removed, and the core effective features are retained to form the final feature vector. The formula for the L1 regularization loss function is as follows: , in, The loss function after regularization For the sample size, For real labels, These are the model's predicted values. This is the regularization coefficient (in this invention, the value is 0.01~0.05, adapted to medical feature dimensions). For characteristic number, The weights of each feature are determined by minimizing the loss function to make the weights of redundant features approach 0, ultimately retaining 8-12 core effective features, thereby improving the model's training efficiency and generalization ability.
[0035] High-quality medical cost data verified manually over the past three years were selected as the basic training set (sample size ≥ 100,000 records). The training set, validation set, and test set were split in a 7:2:1 ratio. To address the issue of low proportion of abnormal / missing samples in medical data (usually < 5%), the SMOTE algorithm was used to balance the training set and generate virtual valid samples, thus avoiding missed / false positives caused by the model biasing towards the majority class.
[0036] S302: For handling outlier data, instead of the traditional single random forest model, a hybrid approach of "random forest-AdaBoost ensemble model + secondary verification of medical business rules" is adopted to achieve accurate identification and differentiated correction of outliers. Specifically, this includes: (1) The random forest-AdaBoost ensemble model (based on random forest and using AdaBoost algorithm to build an ensemble model) is used to calculate the probability of data anomalies. By adaptively adjusting the sample weights, samples with large historical correction errors are given high weights to improve the model’s ability to identify “low probability, high impact” anomalies in medical scenarios (such as cost peaks caused by difficult cases). When the anomaly probability is higher than a dynamic threshold, it is determined to be abnormal data that needs to be corrected; the anomaly probability The calculation formula is as follows: , in, For the first The weights of each decision tree are dynamically calculated by the AdaBoost algorithm based on the tree's classification error; the smaller the error, the greater the weight. To determine the total amount of the decision tree, For the first Decision trees for samples The predicted probability of anomalies; The model input is the feature vector optimized by S301, and the output is the probability of data anomalies, the dynamic threshold. (Initial value is 0.75) The calculation formula is as follows: , in, This represents the mean probability of historical outlier samples. The standard deviation of the probability of historical outlier samples is used to balance the accuracy and recall of outlier identification.
[0037] (2) Based on business logic, the abnormal data to be corrected is divided into data error type abnormalities and business rationality type abnormalities; data error type abnormalities (such as input errors, system synchronization errors) are characterized by large deviations from historical data of the same department and the same project and no reasonable business cause; business rationality type abnormalities (such as difficult cases, temporary use of large equipment) are characterized by clear business cause markers (such as difficult case markers) and deviations within the business explainable range.
[0038] (3) For data error anomalies, the “quantile constraint correction method” is adopted. Based on historical data from the same department, the same service project, and the same period, the reasonable interval of quantiles is calculated using linear interpolation. The formula for linear interpolation is as follows: , in, Indicates quantile value quantiles, For the first The values of each sample, Quantity value This refers to the historical sample size. This embodiment calculates the 90th percentile. with 10th percentile Form a reasonable range .
[0039] (4) Correct outliers to the values closest to the historical median within the specified interval, and mark them for manual review if the correction exceeds a preset threshold (30%); the correction formula is as follows: , The formula for calculating the correction magnitude is as follows: , in, For corrected outliers, This is an outlier.
[0040] (5) For reasonable business anomalies, retain the original values and add reasonable anomaly markers and explanations of causes; (e.g., “Difficult cases lead to a 25% increase in costs”). Retain the original data for subsequent accounting to assign weights separately, and avoid losing effective business information due to excessive correction.
[0041] The model's anomaly identification accuracy is ≥96% and error correction accuracy is ≥94%, which is 8-10 percentage points higher than the single random forest model. The misjudgment rate for business-related anomalies is ≤2%.
[0042] S303: For high-priority missing data (such as equipment depreciation rate, consumable unit price), the traditional single KNN algorithm is abandoned, and a scenario-specific solution of "weighted adaptive KNN + temporal completion model" is adopted. This solution combines the correlation and temporal characteristics of medical data to improve completion accuracy, specifically including: (1) High-priority missing data is divided into random missing data (such as accidental omissions, with no obvious business correlation) and non-random missing data (such as newly purchased equipment without depreciation filing, and temporary consumables without synchronized unit price, with clear business scenario causes).
[0043] (2) For randomly missing data, a weighted adaptive KNN model is used for completion. When calculating the similarity of neighboring samples, strong correlation fields (department code, service item code) are given high weights with a weight ratio of ≥60%, while other weak correlation fields (such as seasonal factors and patient flow) are given low weights. The completion is performed by weighted average formula, and the weights are obtained by weighting the business correlation similarity and numerical feature cosine similarity between the randomly missing data and neighboring samples. The weighted average formula is as follows: , in, To complete the value, For the first Neighboring samples, For the first The weight of each neighboring sample is determined by both business relevance and feature similarity. The weight calculation formula is as follows: , in, A value of 0.7 indicates a business association weight coefficient. A value of 0.3 indicates the feature similarity weight coefficient. For business relevance similarity, a value of 1 is assigned for a complete match of strongly related fields, 0.5 for a partial match, and 0 for no match. This represents the cosine similarity of numerical features. This formula achieves a weighted fusion of business logic and feature similarity, improving completion accuracy by 12-15 percentage points compared to traditional KNN.
[0044] (3) For non-random missing data, the "time series model + business rule constraints" are used to complete the missing data. For fields with historical time series data (such as equipment depreciation rate), a lightweight LSTM model is introduced to predict missing values. At the same time, threshold constraints are imposed in conjunction with medical business rules (such as the depreciation rate of newly added equipment shall not exceed ±10% of the initial depreciation rate of the same type of equipment) to ensure that the completed values conform to the business logic. For fields without time series data (such as the unit price of temporary consumables), the latest inbound unit price of consumables of the same brand and specification is matched and the missing values are completed after correction with the price adjustment coefficient.
[0045] The completed data must meet the format and business validation rules of the rule engine, and the deviation rate between the completed value and the data in the same scenario must be calculated. When the deviation rate exceeds 20%, a secondary completion is triggered, and the completion model is replaced and recalculated. The deviation rate formula is as follows: , in, This is the average of historical data from the same department, service project, and time period. For non-randomly missing time-series fields, additional business rule threshold constraints are added. Taking equipment depreciation rate as an example, the constraint formula is as follows: , in, The depreciation rate is the one that is now complete. To establish initial depreciation rates for similar equipment, ensure that the supplementary data conforms to both time-series forecasting patterns and business standards for medical fixed asset depreciation.
[0046] To adapt to the cost data characteristics of different hospitals and departments, a dynamic optimization mechanism for the machine learning model was established, which is linked to the subsequent S5 feedback optimization process: (1) Each quarter, manually verified corrected data (including corrected and incorrectly corrected examples) will be added to the training set, and old samples (more than 5 years old) will be removed to ensure that the model adapts to changes in data distribution; (2) Based on the accounting deviation data of the feedback loop, automatically adjust the abnormal probability threshold of the ensemble model, the number of neighbors of the KNN model (dynamically adjusted from Top-5 to Top-8), and the time window length of the LSTM model to improve the adaptability of the model in different scenarios. (3) Supports the addition of new medical business features (such as medical insurance policy adjustment identifiers and department level coefficients). Only feature parsing rules need to be added through configuration files, without modifying the core code of the model, and it has good scalability.
[0047] As a refinement of the above embodiments, step S4, based on national medical service-related standards and cost accounting requirements, constructs a standardized medical cost data system to achieve deep data standardization, specifically including three core mappings: Standardization of project names and codes: Establish a mapping dictionary between the "National Medical Service Project Code" and the hospital's internal colloquial names, unify all project names to standardized names, and unify codes to national standard codes (for example, map "electrocardiogram" to "routine electrocardiogram examination", and map the code to "1102020010").
[0048] Cost classification standardization: The cost items in the original data are classified and mapped into four categories: "human resource costs, fixed asset depreciation costs, non-chargeable consumable costs, and costs allocated to logistics and medical support departments", which solves the problem of inconsistent cost classification standards in different departments.
[0049] Unit standardization: unify the units of similar data from different units (for example, convert the units such as "package" and "box" of "consumable usage" into "pieces" according to specifications to facilitate subsequent accounting).
[0050] A standardized mapping module written in Python is used to match the cleaned valid data with the above-mentioned standardization system and output a standardized cost dataset.
[0051] As a refinement of the above embodiment, in step S5, the standardized cost dataset is finally verified, and the data quality indicators (including outlier rate, missing value rate, and standardized matching rate) before and after cleaning are compared. If the indicators do not reach the preset threshold (e.g., outlier rate < 0.5%), the process returns to step 2 to readjust the rule engine parameters.
[0052] The dynamic optimization of the hybrid machine learning model specifically includes: adding corrected data that has been manually reviewed and confirmed to the training set every quarter, removing old samples that have exceeded the preset age, and automatically adjusting the anomaly probability threshold of the ensemble model and the number of neighbors of the KNN model based on the accounting deviation.
[0053] Example 2 This disclosure also provides a medical service cost accounting data cleaning and standardization apparatus, including a processor and a memory. Optionally, the apparatus may further include a communication interface and a bus. The processor, communication interface, and memory can communicate with each other via the bus. The communication interface can be used for information transmission. The processor can invoke logical instructions in the memory to execute the medical service cost accounting data cleaning and standardization method of the above embodiments.
[0054] Furthermore, the logical instructions in the aforementioned memory can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0055] Memory, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this disclosure. The processor executes the program instructions / modules stored in the memory to perform functional applications and data processing, thereby implementing the medical service cost accounting data cleaning and standardization method described in the above embodiments.
[0056] The memory may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory may include high-speed random access memory and may also include non-volatile memory.
[0057] This disclosure provides a computer-readable storage medium storing computer-executable instructions configured to perform the aforementioned medical service cost accounting data cleaning and standardization method.
[0058] The aforementioned computer-readable storage medium may be a transient computer-readable storage medium or a non-transitory computer-readable storage medium.
[0059] The technical solutions of this disclosure can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes one or more instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the method described in this disclosure. The aforementioned storage medium can be a non-transitory storage medium, including: a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and other media capable of storing program code. It can also be a transient storage medium.
[0060] Finally, it should be noted that the above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for cleaning and standardizing medical service cost accounting data, characterized in that, Includes the following steps: S1. Construct a multi-source data interface to read raw data from the hospital's business system, extract core fields, and unify them into a standard field set based on a preset field mapping table to form a structured raw dataset; S2. The structured raw dataset is filtered using preset format validation rules, business validation rules, and integrity validation rules, and the data is divided into valid data, abnormal data to be processed, high-priority missing data, and low-priority missing data. S3. For the abnormal data to be processed and the high-priority missing data, construct a feature vector containing medical business semantic features, and input it into a hybrid machine learning model for processing; wherein, for the abnormal data to be processed, use an integrated model to identify the probability of abnormality and combine business logic classification to perform quantile constraint correction or add reasonable abnormal labels, and use a weighted adaptive KNN model or time series model to complete the high-priority missing data. S4. Standardize the processed data according to national medical service standards for project names and codes, cost classification and units, and output a standardized cost dataset; S5. Perform quality indicator verification on the standardized cost dataset. If the quality indicator fails to meet the standard, adjust the rule engine parameters and supplement the training set with the corrected data that has been manually reviewed and confirmed to dynamically optimize the hybrid machine learning model.
2. The method for cleaning and standardizing medical service cost accounting data according to claim 1, characterized in that, In step S1, the business system includes a hospital information system (HIS), a materials management system, and a financial system. The format of the raw data includes Oracle database tables, Excel files, and XML messages. The core fields include medical service item code, medical service item name, department code, department name, consumable usage, consumable unit price, staff wages, equipment depreciation duration, equipment depreciation rate, and revenue and expenditure amount.
3. The method for cleaning and standardizing medical service cost accounting data according to claim 2, characterized in that, The process of constructing the feature vector in step S3 includes: Extract basic field features and medical business semantic features, including business correlation coefficients, time dimension features, and anomaly cause features; Numerical features are subjected to Min-Max normalization, and categorical features are subjected to one-hot encoding. L1 regularization is introduced for feature selection. Redundant features are removed and the core effective features are retained to form the final feature vector.
4. The method for cleaning and standardizing medical service cost accounting data according to claim 3, characterized in that, The processing of the abnormal data to be processed in step S3 specifically includes: The random forest-AdaBoost ensemble model is used to calculate the anomaly probability of the data. When the anomaly probability is higher than a dynamic threshold, the data is identified as anomaly data that needs to be corrected. The calculation formula is as follows: , in, For the first The weights of each decision tree To determine the total amount of the decision tree, For the first Decision trees for samples The predicted probability of anomalies; The dynamic threshold The calculation formula is as follows: , in, This represents the mean probability of historical outlier samples. The standard deviation of the probability of historical outlier samples; Based on business logic, the abnormal data to be corrected is divided into data error type abnormalities and business reasonable type abnormalities; For data error-related anomalies, a reasonable quantile interval is calculated using linear interpolation based on historical data from the same department, service items, and time period. The formula for linear interpolation is as follows: , in, Indicates quantile value quantiles, For the first The values of each sample, Quantity value This refers to the historical sample size. Outliers are corrected to the value closest to the historical median within the specified interval, and are marked for manual review if the correction exceeds a preset threshold; the correction magnitude is calculated using the following formula: , in, For corrected outliers, This is an outlier; For business-related anomalies, retain the original values and add reasonable anomaly markers and explanations of the causes.
5. The method for cleaning and standardizing medical service cost accounting data according to claim 2, characterized in that, The processing of high-priority missing data in step S3 specifically includes: High-priority missing data is categorized into random missing data and non-random missing data. For randomly missing data, a weighted adaptive KNN model is used for completion. When calculating the similarity between neighboring samples, strong correlation fields are given high weights and weak correlation fields are given low weights. The completion is performed by a weighted average formula, and the weights are obtained by weighting the business correlation similarity and numerical feature cosine similarity between the randomly missing data and neighboring samples. For non-random missing data, if historical time-series data exists, a lightweight LSTM model is used for prediction, and threshold constraints are applied in conjunction with medical business rules. If no time-series data exists, the latest data from similar products is used to fill in the missing data.
6. The method for cleaning and standardizing medical service cost accounting data according to claim 5, characterized in that, The threshold constraints of the medical business rules include: for the supplementary value of the equipment depreciation rate, it is limited to not exceeding 10% of the initial depreciation rate of the same type of equipment.
7. The method for cleaning and standardizing medical service cost accounting data according to claim 6, characterized in that, The high-priority missing data is then validated after completion, including: Verify whether the completed data meets the format verification rules, business verification rules, and integrity verification rules. At the same time, calculate the deviation rate between the completed data and the data of the same department, the same medical service project, and the same time period. If the deviation rate exceeds 20%, trigger a second completion. Add business rule threshold constraints to the historical time series data of non-random missing data.
8. The method for cleaning and standardizing medical service cost accounting data according to claim 2, characterized in that, The quality indicators in step S5 include outlier rate, missing value rate, and standardized matching rate; the dynamic optimization of the hybrid machine learning model specifically includes: adding corrected data that has been manually reviewed and confirmed to the training set every quarter, removing old samples that have exceeded the preset age, and automatically adjusting the outlier probability threshold of the ensemble model and the number of neighbors of the KNN model based on the accounting deviation.
9. A medical service cost accounting data cleaning and standardization device, comprising a processor and a memory storing program instructions, characterized in that, The processor is configured to perform the medical service cost accounting data cleaning and standardization method as described in any one of claims 1-8 when running the program instructions.
10. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the medical service cost accounting data cleaning and standardization method as described in any one of claims 1-8.