A multi-source data integrity reconstruction model based on multi-algorithm fusion
The multi-source data integrity reconstruction model, which integrates multiple algorithms, solves the problems of accuracy, algorithm adaptability, and result reliability in completing multi-source business data in the hotel industry. It realizes an efficient and automated data completion process and is applicable to multi-industry, multi-source, heterogeneous data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUIZHOU JINGZHUN DIGITAL TECHNOLOGY CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for completing multi-source business data in the hotel industry suffer from problems such as low completion accuracy, poor algorithm adaptability, low parameter adjustment efficiency, and insufficient result reliability, making it difficult to meet the enterprise's demand for high-quality data.
A multi-source data integrity reconstruction model based on multi-algorithm fusion is adopted, including a data preprocessing module, an algorithm optimization module, a completion execution module, and an error verification module. Through a multi-dimensional error evaluation system, the model realizes data distribution characteristic analysis, algorithm matching, and adaptive parameter solving, and automates the data completion process.
It significantly improves the accuracy and flexibility of data completion, increases the efficiency of parameter adjustment, ensures the reliability and traceability of completion results, adapts to heterogeneous data from multiple industries and sources, and lowers the threshold for enterprise application.
Smart Images

Figure CN122241570A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of data processing and artificial intelligence technology, specifically to a multi-source data integrity reconstruction model based on multi-algorithm fusion. Background Technology
[0002] With the deepening of digital transformation, industries such as hotels have accumulated massive amounts of multi-source business data, covering core dimensions such as financial data, operational data, market data, and risk data. This type of data is an important foundation for enterprises to conduct business analysis, strategic decision-making, and risk control. However, in the actual data collection, storage, and transmission process, data integrity faces many prominent problems, which seriously affect the value of data application. Specifically, this manifests as: frequent data loss (caused by equipment failure, manual entry omissions, system interface anomalies, etc.), rampant abnormal data (extreme values, entry errors, data distortion, etc.), and strong heterogeneity of multi-source data (inconsistent formats, inconsistent indicator definitions).
[0003] Existing data integrity restoration methods have many shortcomings and are difficult to meet the actual application needs of enterprises. The main deficiencies are as follows: 1. Limited algorithm for data completion and poor adaptability: Traditional data completion methods often use single algorithms such as mean filling and ordinary KNN, which do not take into account the complex distribution characteristics of multi-source heterogeneous data (such as skewed distribution and heavy-tailed distribution). For non-normally distributed business data (such as hotel revenue and passenger flow), the accuracy of the completion results is extremely low and cannot match the inherent patterns of the data. 2. Lack of algorithmic adaptability and reliance on manual intervention: Existing methods do not establish an automatic matching mechanism between algorithms and data features. Manual selection of completion algorithms is required based on data types. At the same time, model parameters (such as KNN nearest neighbor numbers and kernel function parameters) require tedious manual calibration. When data threshold ranges or business indicator dimensions change, parameters need to be readjusted, resulting in low efficiency and poor flexibility. 3. Insufficient data distribution adaptation and weak error control: Traditional methods are mostly designed based on the assumption of normal distribution. For non-normally distributed business data such as skewed and heavy-tailed data, the completion error is large. Moreover, no systematic distribution characteristic analysis mechanism has been established, and it is impossible to accurately match the data distribution type and select the appropriate completion strategy. 4. Lack of a result verification system and inability to guarantee reliability: The existing method does not have a multi-dimensional error assessment system, which makes it impossible to quantify the reliability of the supplementary results. The rationality of the supplementary data lacks verification, which may lead to deviations in subsequent business analysis and decision-making, or even cause business risks.
[0004] Therefore, given the aforementioned shortcomings of existing technologies, there is an urgent need for a multi-source data integrity reconstruction technology that integrates multiple algorithms, adapts to data characteristics, has efficient parameter solving, and verifiable results. This technology aims to address the deficiencies of traditional methods in terms of accuracy of multi-source business data completion, algorithm adaptability, parameter adjustment efficiency, and result reliability in the hotel industry, thereby meeting the urgent needs of enterprises for high-quality data. Summary of the Invention
[0005] To address the shortcomings of existing technologies in terms of accuracy, algorithm adaptability, parameter adjustment efficiency, and result reliability in completing multi-source business data in the hotel industry, this invention provides a multi-source data integrity reconstruction model based on multi-algorithm fusion, comprising: a data preprocessing module, an algorithm optimization module, a completion execution module, an error verification module, and a result output module; The data preprocessing module is used to preprocess multi-source raw data to generate a dataset to be reconstructed and transmit it to the algorithm optimization module; the dataset to be reconstructed includes core indicators and valid samples; the preprocessing includes data integration, data transformation, outlier identification and data cleaning. The algorithm optimization module receives the dataset to be reconstructed, loads a multi-algorithm fusion completion model to analyze the data distribution characteristics and extract core indicator features of the dataset to be reconstructed, formulates algorithm matching rules, and outputs the optimal completion algorithm matched to the dataset to be reconstructed to the completion execution module; the core indicator features include missing rate, distribution characteristics, correlation strength, and dispersion. The completion execution module is used to receive the matched optimal completion algorithm, perform adaptive parameter solving on the matched optimal completion algorithm to achieve automatic calibration; based on the calibrated matched optimal completion algorithm, perform completion operation on the dataset to be reconstructed to generate a preliminary reconstructed dataset, and transmit it to the error verification module. The error verification module is used to receive the preliminary reconstructed dataset, perform a multi-dimensional error comprehensive evaluation on the preliminary reconstructed dataset, and generate a multi-dimensional error comprehensive evaluation result; when the multi-dimensional error comprehensive evaluation result is qualified, the corresponding preliminary reconstructed dataset is output as the final reconstructed dataset to the result output module. The result output module is used to receive the final reconstructed dataset and generate an evaluation report based on the final reconstructed dataset. The evaluation report includes an overall overview of the reconstructed process, multi-dimensional error evaluation results of each indicator feature, outlier identification records, iterative optimization process, and the final reconstructed dataset.
[0006] Furthermore, the multi-dimensional error comprehensive assessment is implemented based on a multi-dimensional error assessment system, which includes distribution consistency assessment, threshold proportion assessment, error quantification assessment, and level matching assessment; the rules for the multi-dimensional error comprehensive assessment are as follows: When all evaluation results of the multi-dimensional error evaluation system are qualified, the comprehensive multi-dimensional error evaluation result is qualified; when any evaluation result of the multi-dimensional error evaluation system is unqualified, the comprehensive multi-dimensional error evaluation result is unqualified, triggering a fallback completion mechanism to re-match the completion algorithm, perform completion operation and error verification on the unqualified data column until all multi-dimensional error evaluation results are qualified.
[0007] Furthermore, the method for determining whether the evaluation result of the multi-dimensional error evaluation system is qualified is as follows: Set multi-dimensional error assessment thresholds, including distribution consistency assessment threshold, threshold proportion assessment threshold, error quantification assessment threshold, and level matching assessment threshold; When the multi-dimensional error evaluation result of the preliminary reconstructed dataset exceeds the multi-dimensional error evaluation threshold, it is judged as unqualified, and the fallback completion mechanism is triggered. When the multi-dimensional error evaluation result of the preliminary reconstructed dataset does not exceed the multi-dimensional error evaluation threshold, it is judged as qualified, and the corresponding preliminary reconstructed dataset is output.
[0008] Furthermore, the fallback completion mechanism is set in the algorithm optimization module. The fallback completion mechanism is as follows: for the data columns whose multi-dimensional error evaluation results are unqualified, match the GNN-KNN model, re-execute the completion operation and error verification, until all the multi-dimensional error evaluation results are qualified.
[0009] Furthermore, the multi-algorithm fusion completion model includes a GNN-KNN model and a multi-variant KNN model. The GNN-KNN model converts the tabular data in the dataset to be reconstructed into a graph structure, extracts feature indicators, and realizes K-nearest neighbor query based on Euclidean distance calculation. The multi-variant KNN model includes three variant models: the original KNN, the RBF kernel KNN, and the Sigmoid kernel KNN. Specifically, the variant model of the original KNN directly calculates the Euclidean distance after standardizing the dataset to be reconstructed; the variant model of the RBF kernel KNN optimizes the distance calculation through the RBF kernel function; and the variant model of the Sigmoid kernel KNN calculates the distance after transforming the data through the Sigmoid function.
[0010] Furthermore, the data distribution characteristic analysis includes the following: The data distribution characteristics are analyzed using a dual-dimensional approach of confidence interval (CI) and highest density interval (HDI). A distribution coverage threshold is set, and the HDI / CI width ratio is calculated. The data distribution type is then determined based on the HDI / CI width ratio. Wherein, 0.8 ≤ HDI / CI width ratio ≤ 1 indicates normally distributed data; The HDI / CI width ratio > 1.2 indicates skewed / heavy-tailed distribution data; The HDI / CI width ratio <0.8 indicates highly concentrated data distribution.
[0011] Furthermore, the matching rules of the algorithm include the following: For skewed / heavy-tailed distribution data in the dataset to be reconstructed, if the correlation strength is > 0.4, the GNN-KNN model is matched; if 0.2 < correlation strength ≤ 0.4, the RBF kernel KNN model is matched; if the correlation strength ≤ 0.2, the Sigmoid kernel KNN model is matched. For normally distributed data in the dataset to be reconstructed, if the missing rate is ≤0.3, the original KNN model is matched; if the missing rate is >0.3, the GNN-KNN model is matched. For highly concentrated data in the dataset to be reconstructed, the original KNN model is matched. If the number of non-empty samples in the dataset to be reconstructed is less than 2×K, then the GNN-KNN model is matched, where K is the number of nearest neighbors.
[0012] Furthermore, the outlier identification includes the identification of extreme values and missing values. The identification of extreme values is carried out by using the interquartile range (IQR) to identify the extreme values of each indicator. The identification of missing values is carried out by scanning the multi-source raw data column by column to identify missing values. The extreme values include values that are lower than Q1-1.5IQR or higher than Q3+1.5IQR. The missing values include null values, empty strings, and invalid identifiers.
[0013] Furthermore, the effective samples are data columns with a missing rate ≥ a preset missing rate; the expression for calculating the missing rate is: Missing rate = number of missing values / total number of data samples.
[0014] Furthermore, the parameter adaptive solution involves setting differentiated parameter adaptive solution logic for different models in the multi-algorithm fusion completion model; The adaptive parameter solving logic of the GNN-KNN model is to set the hidden layer dimension, output dimension, and nearest neighbor number K, and dynamically adjust the number of undirected edges in the graph structure based on the number of samples. The adaptive parameter solving logic of the multi-variant KNN model involves standardizing the data and setting the kernel function parameters for RBF kernel KNN and Sigmoid kernel KNN. Dynamically fine-tuning kernel function parameters based on data dispersion .
[0015] The beneficial effects of this invention are: (1) Significantly improved completion accuracy: By integrating GNN-KNN and multi-variant KNN algorithms, the algorithm automatically selects the best algorithm to achieve accurate matching between data with different distribution characteristics and the optimal completion algorithm. This breaks through the limitations of traditional single algorithms, and the completion results are more in line with the inherent laws of the data and business logic, effectively reducing the completion error.
[0016] (2) Significant adaptability and flexibility: It adopts an automatic optimization mechanism based on data features, which can complete algorithm selection and parameter calibration without manual intervention, and adapt to multi-source heterogeneous data in multiple industries such as hotels; (3) The efficiency of parameter solving is greatly improved: Through data distribution analysis and parameter adaptive solving mechanism, the automatic calculation and fine-tuning of all parameters of the model are realized, reducing the parameter adjustment time from several hours in the traditional method to minutes, which significantly improves the deployment efficiency and iteration speed of the model.
[0017] (4) Optimization of result reliability and traceability: Through a multi-dimensional error evaluation system, the reliability of the completion results is fully verified from multiple perspectives such as distribution consistency, error quantification, and level matching, ensuring that the completed data conforms to business logic. At the same time, detailed evaluation reports and algorithm selection criteria are output to achieve traceability and verifiability of the completion results, providing reliable data support for subsequent business decisions. (5) Strong engineering practicality and low implementation cost: The model is developed based on Python language and relies on mature open source frameworks such as pandas, numpy, and PyTorch. It has low development cost, is easy to maintain, supports input and output of multiple source data file formats, adapts to the existing data storage format of enterprises, does not require modification of the existing data system, can be directly implemented in engineering, reduces the application threshold of enterprises, and is also suitable for multiple industry scenarios such as hotels, catering, and retail. It has strong versatility and wide application range. Attached Figure Description
[0018] Figure 1 This is a schematic diagram of the multi-source data integrity reconstruction model architecture based on multi-algorithm fusion provided by the present invention; Figure 2 This is a framework diagram of the multi-algorithm fusion completion model provided by the present invention; Figure 3 This is a diagram showing the matching relationship between data distribution characteristics and algorithms provided by this invention; Figure 4This is an evaluation relationship diagram of the multi-dimensional error evaluation system provided by the present invention; Figure 5 This is a comparison chart of the data distribution before and after the completion of the three typical indicators provided by this invention; Figure 6 This is a comparison chart of the errors of three typical indicators and multiple algorithms provided by this invention. Detailed Implementation
[0019] The technical solution of the present invention is further described below, but the scope of protection is not limited to what is described.
[0020] This invention provides a multi-source data integrity reconstruction model based on multi-algorithm fusion, such as... Figure 1 As shown, it includes: a data preprocessing module P100, an algorithm optimization module P200, a completion execution module P300, an error verification module P400, and a result output module P500; The data preprocessing module P100 is used to preprocess multi-source raw data to generate a dataset to be reconstructed and transmit it to the algorithm optimization module. The dataset to be reconstructed includes core indicators and valid samples. The preprocessing includes data integration, data transformation, outlier identification and data cleaning. The outlier identification includes the identification of extreme values and missing values. The extreme value identification is performed by using the interquartile range (IQR) to identify the extreme values of each indicator. The missing value identification is performed by scanning each column of the multi-source raw data to identify missing values. The extreme values include values that are lower than Q1-1.5IQR or higher than Q3+1.5IQR. The missing values include null values, empty strings, and invalid identifiers.
[0021] The valid samples are data columns with a missing rate ≥ a preset missing rate; the expression for calculating the missing rate is: Missing rate = number of missing values / total number of data samples.
[0022] Taking specific business scenarios such as hotels as an example, the data integration includes: loading raw data from multiple sources, merging datasets, removing redundant columns, linearly correlated columns and duplicate data, retaining core business indicators (such as financial, operational, marketing, risk, etc.), and ensuring the effectiveness of data dimensions; The data transformation includes: calculating derived indicators based on business logic (such as average monthly operating revenue = cumulative operating revenue for the current period / cumulative months of operating data for the current period, fixed costs = monthly labor costs + monthly rent + amortization of decoration costs, etc.); and encoding category features such as "hotel type" and "property type" using one-hot encoding or numerical mapping methods to unify data format and indicator caliber and eliminate the influence of dimensions.
[0023] The outlier identification process includes: using the interquartile range (IQR) to identify extreme values for each indicator (values below Q1-1.5IQR or above Q3+1.5IQR are considered extreme values); scanning each column of the multi-source raw data to identify typical missing value types such as null values, empty strings, and invalid identifiers (e.g., NA, NaN, -999, etc.); automatically counting the number of missing values for each indicator and calculating the missing rate (number of missing values / total number of samples); filtering out valid indicator columns with a missing rate ≥ a preset threshold (default missing rate ≤ 0.5); and removing invalid data columns. The data cleaning process includes: removing invalid data rows where all indicators are empty, filling in obvious erroneous values (such as converting "<1km" in the "distance from business district" field to 500m), standardizing the data format, and finally forming a unified and valid dataset to be reconstructed.
[0024] The final dataset consists of 18 core indicators and 5,260 valid samples, which are categorized by business dimension as follows, covering core data throughout the entire hotel operation process: 1. Operational metrics (6): Average monthly occupancy rate, average room rate, number of room nights, store traffic, number of transactions, and operating costs; 2. Financial indicators (6): Debt-to-equity ratio, sales profit margin, net profit, revenue growth rate, return on net assets, and gross profit margin; 3. Market-related indicators (3): OTA platform rating, surrounding amenities, and regional competitiveness ranking; 4. Risk prevention indicators (3): litigation status, credit default status, and financial and tax compliance.
[0025] The algorithm optimization module P200 is used to receive the dataset to be reconstructed, load a multi-algorithm fusion completion model to analyze the data distribution characteristics and extract core indicator features of the dataset, formulate algorithm matching rules, and output the optimal completion algorithm matched to the dataset to be reconstructed to the completion execution module. Figure 2 As shown; The multi-algorithm fusion completion model includes a GNN-KNN model and a multi-variant KNN model. The GNN-KNN model converts the tabular data in the dataset to be reconstructed into a graph structure, extracts feature indicators, and realizes K-nearest neighbor query based on Euclidean distance calculation. The multi-variant KNN model includes three variant models: original KNN, RBF kernel KNN, and Sigmoid kernel KNN. Specifically, the GNN-KNN model converts the tabular data of the dataset to be reconstructed into a graph structure, where each row of data serves as a node in the graph, and the standardized / normalized feature vector of each row serves as the feature of the corresponding node. Based on the Euclidean distance KNN (k=5) of the feature vector, a 5-nearest neighbor undirected edge is constructed for each node to explicitly mine the feature similarity association between samples. The topological information of the graph is aggregated by a two-layer graph convolutional network (GCN) with normalization to extract low-dimensional embedding features of the nodes. K-nearest neighbor (KNN) query is realized based on the Euclidean distance calculation in the embedding space, and the Top-K nearest neighbor index and corresponding distance of each sample are output for subsequent missing value completion. The variant model of the original KNN standardizes the dataset to be reconstructed and then directly calculates the Euclidean distance, which is suitable for both normally distributed data and highly concentrated distributed data. The variant model of the RBF kernel KNN optimizes distance calculation through the RBF kernel function, adapting to moderately weakly correlated skewed / heavy-tailed distribution data. The expression is: (1) in, For RBF kernel function, For kernel function parameters, The square of the Euclidean distance between two samples, These are two sample data points selected from the dataset to be reconstructed.
[0026] The variant model of the Sigmoid kernel KNN calculates distance by transforming the data using the Sigmoid function, thus adapting to weakly correlated data; the expression for the Sigmoid function is: (2) Where x is the sample data of the dataset to be reconstructed.
[0027] The data distribution characteristic analysis includes the following: The data distribution characteristics are analyzed using a dual-dimensional approach: confidence interval (CI) and highest density interval (HDI). A distribution coverage threshold is set, and the HDI / CI width ratio is calculated. The data distribution type is determined by the HDI / CI width ratio. In the hotel business scenario, the default distribution coverage threshold is 90%, which can be adjusted to 80% or 95% according to business needs. Where 0.8 ≤ HDI / CI width ratio ≤ 1.2 indicates normally distributed data; The HDI / CI width ratio > 1.2 indicates skewed / heavy-tailed distribution data; The HDI / CI width ratio <0.8 indicates highly concentrated data distribution.
[0028] The core indicator features include missing rate (the proportion of missing values to the total number of samples), distribution characteristics (determined by the HDI / CI width ratio, where HDI is the highest density interval and CI is the confidence interval), correlation strength (the maximum absolute correlation coefficient between the target indicator and all other indicators), and dispersion (coefficient of variation = standard deviation / mean, eliminating the influence of dimensions). The matching rules of the algorithm are formulated based on the characteristics of data distribution, such as Figure 3 As shown, it includes the following: For skewed / heavy-tailed distribution data (HDI / CI width ratio > 1.2) in the dataset to be reconstructed, if the correlation strength > 0.4 (strong correlation), the GNN-KNN model is matched (mining nonlinear associations in the data); if 0.2 < correlation strength ≤ 0.4 (medium to weak correlation), the RBF kernel KNN model is matched; if the correlation strength ≤ 0.2 (weak correlation), the Sigmoid kernel KNN model is matched. For normally distributed data in the dataset to be reconstructed (0.8 ≤ HDI / CI width ratio ≤ 1.2), if the missing rate is ≤ 0.3, the original KNN model is matched (high efficiency and stable performance); if the missing rate is > 0.3, the GNN-KNN model is matched (higher completion accuracy). For highly concentrated data (HDI / CI width ratio < 0.8) in the dataset to be reconstructed, the original KNN model is matched (simple, efficient, and can meet the completion requirements). If the number of non-empty samples in the dataset to be reconstructed is less than 2×K, then the GNN-KNN model is matched (it is robust and adaptable to data with insufficient sample size), where K is the number of nearest neighbors.
[0029] The algorithm optimization module includes a fallback completion mechanism. This mechanism works as follows: for data columns with unqualified multi-dimensional error evaluation results, a GNN-KNN model is matched, and the completion operation and error verification are re-executed until all multi-dimensional error evaluation results are qualified. (For example, if the mean relative error (MRE) in the multi-dimensional error evaluation exceeds a preset threshold of 0.15, a GNN-KNN model is matched and the data is re-completed to ensure the validity of the completion result.)
[0030] Typical indicator processing examples: To more clearly illustrate the differences in processing different feature indicators, the following table details the data distribution characteristics, automatic algorithm optimization process, and preprocessing details for three typical indicators, as shown in Table 1 below: Table 1
[0031] The completion execution module P300 is used to receive the matched optimal completion algorithm, perform adaptive parameter solving on the matched optimal completion algorithm to achieve automated calibration, eliminate the cumbersome process of traditional manual parameter adjustment, and improve model execution efficiency; based on the calibrated matched optimal completion algorithm, perform completion operation on the dataset to be reconstructed to generate a preliminary reconstructed dataset, and transmit it to the error verification module. The parameter adaptive solution is to set differentiated parameter adaptive solution logic for different models in the multi-algorithm fusion completion model; The adaptive parameter solving logic of the GNN-KNN model is to set the hidden layer dimension (default 64), output dimension (default 32), and nearest neighbor number K (default 5), and dynamically adjust the number of undirected edges in the graph structure based on the number of samples (the more samples, the more edges, to ensure the rationality of the graph structure). The adaptive parameter solving logic of the multi-variant KNN model involves standardizing the data (using StandardScaler) and setting the kernel function parameters of the RBF kernel KNN and the Sigmoid kernel KNN. (Default value is 0.1), dynamically fine-tuning kernel function parameters based on data dispersion (coefficient of variation). (The larger the coefficient of variation, the better) The smaller the value, the more accurate the distance calculation.
[0032] Error assessment parameters: The default setting is a distribution range with 90% coverage, which can be flexibly adjusted according to business needs to adapt to error assessment requirements in different scenarios.
[0033] Typical indicator processing examples: The following details the adaptive parameter calibration, based on the distribution characteristics of the three typical indicators, to ensure that the parameters accurately match the data characteristics of each indicator: (1) Monthly average occupancy rate (adapted to GNN-KNN model): The number of nearest neighbors K=5 is automatically set, the hidden layer dimension of GCN is 64, and the output dimension is 32. Based on the number of 5260 samples, the number of undirected edges is calculated to be 13150 (k×N / 2). No additional fine-tuning is required (heavy-tailed distribution + strong correlation characteristics adapt to default parameters, and the robustness of the model can cover the discrete characteristics of the data). (2) Average monthly operating revenue (adapted to RBF kernel KNN model): The kernel function parameter is set to 0.1 by default. Based on its coefficient of variation of 0.41 (medium to high dispersion), the value is finely adjusted to 0.09 to optimize the distance calculation accuracy of skewed data and reduce the impact of extreme values on the parameters. (3) OTA platform rating (adapted to the original KNN model): StandardScaler is used for standardization, with K=5 nearest neighbors. No additional parameter adjustment is required (close to normal distribution + low dispersion, the default parameters can meet the completion accuracy requirements).
[0034] The error verification module P400 is used to receive the preliminary reconstructed dataset, perform a multi-dimensional error comprehensive evaluation on the preliminary reconstructed dataset, and generate a multi-dimensional error comprehensive evaluation result; when the multi-dimensional error comprehensive evaluation result is qualified, the corresponding preliminary reconstructed dataset is output as the final reconstructed dataset to the result output module. The multi-dimensional error comprehensive assessment is based on a multi-dimensional error assessment system, which includes distribution consistency assessment, threshold proportion assessment, error quantification assessment, and level matching assessment, such as... Figure 4 As shown.
[0035] The distribution consistency assessment involves calculating the HDI / CI width ratio of the completed data and comparing it with the HDI / CI width ratio of the original data to verify the consistency of the distribution between the completed data and the original data, ensuring that the completion result conforms to the inherent pattern of the data (the ratio between the two is between 0.9 and 1.1, which is considered to be consistent distribution). The threshold proportion assessment involves setting key thresholds for each indicator based on business needs (e.g., the threshold for the hotel's "average monthly occupancy rate" is [0.6, 0.8]), statistically analyzing the proportion of samples within the key threshold range in the supplementary data, verifying the rationality of the distribution of the supplementary data under the key thresholds, and ensuring that it meets business expectations. The error quantification assessment involves calculating the mean absolute error (MAE) and mean relative error (MRE) of the preliminary reconstructed dataset, setting an error threshold (default MRE≤0.15), filtering the preliminary reconstructed datasets whose errors meet the requirements, removing data columns whose errors do not meet the requirements, and triggering a fallback completion mechanism. The grade matching assessment: For rating-based indicators (such as hotel financial health check scores, OTA ratings), preset grade classification standards (excellent, normal, warning, risk) are set, and the grade matching degree between the supplemented score and the original score is verified to ensure that the supplemented result conforms to the business grade judgment logic (grade matching degree ≥90%, judged as qualified).
[0036] The rules for the comprehensive evaluation of multi-dimensional errors are as follows: When all evaluation results of the multi-dimensional error evaluation system are qualified, the comprehensive multi-dimensional error evaluation result is qualified; when any evaluation result of the multi-dimensional error evaluation system is unqualified, the comprehensive multi-dimensional error evaluation result is unqualified, triggering a fallback completion mechanism to re-match the completion algorithm, perform completion operation and error verification on the unqualified data column until all multi-dimensional error evaluation results are qualified.
[0037] The method for determining whether the evaluation results of the multi-dimensional error evaluation system are qualified is as follows: Set multi-dimensional error assessment thresholds, including distribution consistency assessment threshold, threshold proportion assessment threshold, error quantification assessment threshold, and level matching assessment threshold; When the multi-dimensional error evaluation result of the preliminary reconstructed dataset exceeds the multi-dimensional error evaluation threshold, it is judged as unqualified, and the fallback completion mechanism is triggered. When the multi-dimensional error evaluation result of the preliminary reconstructed dataset does not exceed the multi-dimensional error evaluation threshold, it is judged as qualified, and the corresponding preliminary reconstructed dataset is output.
[0038] Typical indicator processing examples: The completion results of the three typical indicators were evaluated from multiple dimensions. The specific results are shown in Table 2, and all of them meet the evaluation criteria. Table 2
[0039] The result output module P500 receives the final reconstructed dataset and generates an evaluation report based on it. The evaluation report includes an overall overview of the reconstruction process, multi-dimensional error evaluation results for each indicator, outlier identification records, iterative optimization process, and the final reconstructed dataset. The evaluation report supports traceability and verification of the completion process. The reconstructed dataset and evaluation report after missing data completion can support subsequent data analysis systems and business decision-making processes.
[0040] For multi-source operational data of hotels, data integrity reconstruction is performed according to the multi-source data integrity reconstruction model based on multi-algorithm fusion provided by this invention. A specific example is as follows: In the data preprocessing module, multi-source hotel data is integrated, preprocessing is completed, and 18 core indicators are finally determined to form a dataset to be reconstructed.
[0041] Data integration: Load hotel basic data (hotel name, property type, years of operation, etc.), input data items (financial data, operational data, market capability data, risk prevention data), reference data (operating data of similar hotels in the same area), and other multi-source data files. After merging the datasets, remove redundant columns such as "medical examination time" and "medical examination number", delete the "cumulative operating revenue" derived column that is linearly related to "monthly operating revenue", and retain 20 core business indicators. Data Conversion: Calculate derived indicators, including "Average Monthly Revenue for the Current Period = Cumulative Revenue for the Current Period / Cumulative Months of Revenue for the Current Period", "Fixed Costs = Monthly Labor Costs + Monthly Rent + Amortization of Renovation Costs + Monthly Energy Costs × 0.4 + Monthly Maintenance Costs", and "Variable Costs = Monthly Energy Costs × 0.6 + Monthly Material Costs"; Use unique hot coding for "Hotel Type" (Business Hotel, Resort Hotel, Budget Hotel) and numerical mapping (Owned Property, Leased Property) for "Property Type" (Owned Property = 0, Leased Property = 1), and unify the data format; Outlier identification: Using the IQR method, extreme values of indicators such as "monthly sales revenue" and "total assets" are removed. For example, for the "monthly sales revenue" indicator, Q1=500,000 yuan, Q3=1,200,000 yuan, and IQR=700,000 yuan, values below 500,000 - 1.5 × 700,000 = -550,000 yuan and above 1,200,000 + 1.5 × 700,000 = 2,250,000 yuan are identified as extreme values and marked. Simultaneously, an automated script scans the preprocessed dataset column by column to accurately identify typical missing value types such as null values, empty strings, and industry-standard invalid identifiers (such as NA, NaN, -999, NULL). Missing values for each indicator are batch-marked and counted, the number of missing values for each indicator is automatically counted, and the missing rate is calculated (missing rate = number of missing values for a single indicator / total number of samples in the dataset). Valid indicator columns with a missing rate ≤ 0.5 are selected. Data cleaning: 12 invalid data rows with all key indicators such as "monthly operating revenue" and "fixed costs" empty were removed. The "distance from business district" field was converted from "<1km", "1-2km" to ">2km" to 500m, 1500m and 2500m respectively. All numerical indicators were standardized, resulting in 5260 valid samples containing 18 core indicators. The preprocessing details for typical indicators are as follows: (1) Monthly average occupancy rate: Min-MaxScaler was used to normalize to the [0,1] interval, and extreme invalid values <0.2 (such as 0 occupancy rate caused by temporary business closure during the epidemic) were removed, while retaining the heavy-tailed distribution characteristics; (2) Average monthly revenue: The impact of extreme revenue values is compressed by logarithmic transformation (ln(x+1)), and then the StandardScaler is used to standardize and eliminate the dimensions to adapt to the distance calculation of the subsequent RBF kernel KNN model; (3) OTA platform rating: StandardScaler standardization is directly adopted without additional distribution correction, and the original distribution trend of 1-5 points is retained to ensure that the rating completion results are consistent with the actual service quality.
[0042] In the algorithm optimization module, a multi-algorithm fusion completion model is loaded, data features of 18 core indicators are extracted, and automatic algorithm optimization is completed based on preset rules. A fallback verification mechanism is triggered to ensure that each indicator matches the optimal completion algorithm, as detailed below: Multi-algorithm fusion completion model: (1) GNN-KNN model: The graph structure is constructed using PyTorch Geometric. The 5260 samples of the dataset to be reconstructed are used as 5260 nodes, and the 18 index features of each sample are used as the feature vector of the corresponding node. 13150 undirected edges are randomly generated to construct the graph network. A two-layer GCN model is constructed. The first layer has an input dimension of 18 (number of indicators) and a hidden layer dimension of 64. The second layer has a hidden layer dimension of 64 and an output dimension of 32. The GCN model is trained to extract the low-dimensional embedding features of the nodes. Based on the Euclidean distance in the embedding space, the Top-5 nearest neighbors of each sample and their corresponding distances are calculated for missing value completion. (2) Multivariate KNN model: StandardScaler is used to standardize the data, and the original KNN directly calculates the Euclidean distance of the standardized data; RBF kernel KNN settings =0.1, the nearest neighbor is calculated after the distance is transformed by the RBF kernel function; the Sigmoid kernel KNN performs clipping processing on the data (limited to the range of [-5,5]), and calculates the Euclidean distance after transformation by the Sigmoid function; The algorithm automatically selects the optimal execution: Distribution analysis was performed on each of the 18 core indicators, the HDI / CI width ratio was calculated, and the distribution type of each indicator was determined to provide a basis for matching the model algorithm and adjusting parameters. Among them, the operational indicators were mostly skewed / heavy-tailed distributions, the financial indicators were mainly skewed distributions, and the market and risk prevention indicators were mostly close to normal distributions. The missing rate, HDI / CI width ratio, correlation strength, and coefficient of variation characteristics were extracted one by one. The optimal algorithm was matched against the selection rules, and the fallback verification was completed to determine the final completion algorithm for each indicator.
[0043] Details of typical indicator algorithm optimization: (1) Monthly average occupancy rate: core features were extracted (missing rate 0.12, HDI / CI width ratio 1.3, correlation strength 0.45, coefficient of variation 0.23), and it was determined to be "heavy-tailed distribution + strong correlation". The GNN-KNN model was automatically selected. After completion, MRE=0.12<0.15, which was determined to be the optimal algorithm. (2) Average monthly operating revenue: core features were extracted (missing rate 0.21, HDI / CI width ratio 1.15, correlation strength 0.33, coefficient of variation 0.41), and it was determined to be "skewed distribution + moderate correlation". The RBF kernel KNN model was automatically selected. After completion, MRE=0.13<0.15, which was determined to be the optimal algorithm. (3) OTA platform score: Extract core features (missing rate 0.09, HDI / CI width ratio 0.95, correlation strength 0.28, coefficient of variation 0.18), judged as "close to normal distribution + low missing rate", automatically select the original KNN model, after completion MRE=0.08<0.15, determined as the optimal algorithm.
[0044] In the completion execution module, based on the distribution characteristics of each indicator and the adaptation algorithm, the model parameter calibration and fine-tuning are automatically completed without manual intervention, as detailed below: Automatic parameter calibration: Based on the adaptation algorithm of each indicator, the parameters of the GNN-KNN model and the multi-variant KNN model are adaptively adjusted to ensure that the parameters are accurately matched with the data features; Typical indicator parameter calibration details: (1) Monthly average occupancy rate (adapted to GNN-KNN model): The number of nearest neighbors K is automatically set to 5, the hidden layer dimension of GCN is 64, the output dimension is 32, and the number of undirected edges is calculated to be 13150 based on the number of 5260 samples. No additional fine-tuning is required. (2) Average monthly operating revenue (adapted to RBF kernel KNN model): The kernel function parameter is set to 0.1 by default. Based on its coefficient of variation of 0.41 (medium to high dispersion), the value is finely adjusted to 0.09 to optimize the distance calculation accuracy of skewed data. (3) OTA platform rating (adapted to the original KNN model): StandardScaler is used for standardization, with K=5 nearest neighbors. No additional parameter adjustment is required, and the default parameters can meet the completion accuracy requirements.
[0045] In the error verification module, a multi-dimensional error evaluation system is adopted to evaluate the completion results of 18 core indicators one by one, including distribution consistency, threshold proportion, error quantification, and level matching. Data columns with excessive errors are screened and optimization is triggered. The entire process is automated using Python open-source tools, without manual intervention, as detailed below: 1. Pre-assessment preparation: Based on hotel industry operating standards and historical data patterns, differentiated key thresholds and grading standards were set for each of the 18 core indicators, forming a standardized assessment template (adaptable to various industry scenarios); an automated assessment script was built, relying on SciPy and NumPy tools to achieve batch calculation of parameters such as HDI / CI width ratio, MAE, and MRE, improving assessment efficiency and avoiding errors from manual calculation. 2. Implementation of itemized assessments: (1) Distribution consistency assessment: For each core indicator, calculate HDI and CI at 90% coverage of the original data and the supplementary data respectively. The distribution consistency is determined by the width ratio of the two. A ratio between 0.9 and 1.1 is considered qualified, indicating that the supplementary data retains the distribution characteristics of the original data. If the ratio exceeds this range, it is marked as a distribution anomaly, triggering the subsequent optimization process.
[0046] By comparing the data distribution before and after the completion of three typical indicators, the differences and consistency characteristics of the data distribution before and after completion are reflected, such as... Figure 5 As shown: Before the completion of the monthly average occupancy rate data, it exhibited a normalized heavy-tailed distribution, with the majority concentrated in the 0.6-0.85 range and 5% of extreme values distributed in the 0.05-0.15 range. After the completion, the heavy-tailed distribution characteristic was still retained, and the extreme value bias was corrected. Before the completion of the monthly average operating revenue data, it exhibited a skewed distribution, with the majority concentrated in the 500,000-1,500,000 RMB range and 15% of high-end hotels having extreme revenue values distributed in the 2,000,000-3,500,000 RMB range. After the completion, the skewed distribution was maintained, while the interference from extreme values was reduced. Before the completion of the OTA platform rating data, it was close to a normal distribution, concentrated in the 4-4.5 score range. After the completion, the normal distribution trend did not deviate, which is consistent with the actual service quality rating patterns of hotels. The HDI / CI width ratios of the completed data and the original data for the three major indicators were 0.98, 0.97, and 0.99, respectively, all within the acceptable range of 0.9 to 1.1. This fully verifies the distribution consistency of the completed results and also confirms the rationality of the algorithm's automatic optimization results. In other words, the completed algorithms adapted to each indicator can accurately match the data distribution characteristics and avoid distribution distortion problems during the completed process.
[0047] (2) Threshold ratio assessment: Based on industry experience, set key thresholds for each indicator (such as focusing on reasonable operating range for operational indicators and aligning with industry profitability for financial indicators), and statistically analyze the percentage of samples in the supplementary data that fall within the key threshold range. A percentage of ≥60% is considered qualified (can be adjusted as needed) to ensure that the supplementary data aligns with the actual business scenario requirements. (3) Error Quantification Assessment: A dual error index is used for joint judgment. The Mean Absolute Error (MAE) measures absolute deviation, and the Mean Relative Error (MRE) measures relative deviation. MRE ≤ 0.15 is the core qualification standard, and MAE must match the numerical magnitude of each indicator (e.g., MAE ≤ 50,000 RMB for revenue-related indicators, MAE ≤ 0.2 points for rating-related indicators). If any error index fails to meet the standard, it is judged as an error exceeding the standard, triggering a fallback compensation mechanism. Simultaneously, comparing the MRE of the indicators under different algorithms also highlights the rationality of matching the optimal algorithm, such as... Figure 6 As shown.
[0048] (4) Level matching assessment: For rating-type and interval-type indicators, four levels of classification standards are preset: excellent, normal, warning, and risk. The supplementary data and the original data are classified into the corresponding levels respectively. The level matching degree (number of matching samples / total number of samples) is calculated. The level matching degree is ≥90% as qualified, ensuring that the supplementary results conform to the business level judgment logic and support subsequent business decisions. 3. Evaluation Result Processing: Summarize the four evaluation results of 18 core indicators, generate an evaluation summary table, and clearly mark whether each indicator is qualified, abnormal items and the reasons for the abnormality; for unqualified indicators, record the error data and distribution deviation details, and simultaneously trigger the fallback completion mechanism to re-complete the indicators using the GNN-KNN model until all indicators meet the qualification standards. Typical indicator error assessment details: (1) Monthly average occupancy rate: In the distribution consistency assessment, the original data HDI / CI width ratio was 1.3, the completed data was 1.28, the ratio was 0.98 (in the qualified range of 0.9~1.1), and the distribution characteristics were completely preserved; the threshold was set to [0.6, 0.8] (the reasonable occupancy rate range for hotels), and the proportion of samples in the completed data that were in this range was 88% (≥60%); in the error quantification, MAE=0.03 (fitting the occupancy rate level) and MRE=0.12 (≤0.15); the level matching degree was 92% (≥90%), all four assessments were qualified, and the completion results were in line with the hotel's peak and off-peak season operation patterns. (2) Average monthly operating revenue: The distribution consistency ratio is 0.97 (the width ratio of HDI / CI in the supplementary data is 1.12, and that of the original data is 1.15), and the distribution deviation meets the requirements; the key threshold is set at [800,000 to 1.5 million yuan] (the reasonable range of revenue for most hotels), and the threshold accounts for 65%; in the error quantification, MAE = 42,000 yuan (≤ 50,000 yuan) and MRE = 0.13 (≤ 0.15); the grade matching degree is 89%, which is close to the 90% qualified standard. Combined with the characteristics of financial data, it is judged to be qualified. The supplementary data effectively avoids the interference of extreme revenue values of high-end hotels and conforms to the industry revenue distribution pattern. (3) OTA platform rating: The distribution consistency ratio is 0.99 (0.94 for the completed data and 0.95 for the original data), and the distribution characteristics are almost without deviation; the key threshold is set at [4.0, 4.5 points] (the hotel's excellent service rating range), and the threshold accounts for 76%; in error quantification, MAE=0.15 points (≤0.2 points) and MRE=0.08 (≤0.15); the level matching degree is 94%, all four assessments are qualified, the completed score matches the actual service quality of the hotel well, and there are no abnormal deviation values.
[0049] After the aforementioned modules complete the complete reconstruction of 18 core indicators and 5260 sample data points, the result output module outputs a standardized dataset and a traceable assessment report, balancing data usability and business adaptability. This dataset can be directly integrated into the enterprise's subsequent data analysis and decision-making systems, as detailed below: (1) Dataset output: Save the reconstructed dataset after the missing data is filled in. It contains 5260 valid samples and 18 core indicators. A data dictionary is generated simultaneously to clarify the definition, calculation logic, completion algorithm and data source of each indicator, which facilitates subsequent data traceability and reuse. It supports docking with data analysis tools such as pandas and Tableau and is compatible with the existing data processing process of enterprises. (2) Evaluation Report Output: Generate a detailed integrity reconstruction evaluation report, including an overall execution overview (sample size, number of indicators, completion rate, number of outliers corrected), detailed evaluation results for each indicator (optimal algorithm, number of missing values, MAE, MRE, distribution consistency ratio, grade matching degree), outlier handling records, iterative optimization process, and final conclusions, providing quantitative support for data reliability and assisting in enterprise business decision-making, as shown in Table 3 example:
[0050] Table 3 (3) Archiving and retention: The reconstructed dataset, evaluation report, algorithm parameter configuration file, and automation script are archived together to form a complete implementation document, which facilitates subsequent model iteration, parameter adjustment and reuse in multiple industry scenarios, and reduces the cost of secondary development and deployment. Details of typical indicator results: (1) Monthly average occupancy rate: The GNN-KNN model was used to complete the data and correct outliers. The data passed the S40 multi-dimensional evaluation and no secondary optimization was required. The final completed data retained the heavy-tailed distribution characteristics, with the peak value concentrated between 70% and 85%. The proportion of extreme low occupancy rates was controlled within 5%, which is consistent with the hotel's peak and off-peak season operation patterns and there were no abnormal deviations. The original value range (0-100%) after normalization can be directly used for hotel operation scheduling, marketing strategy formulation and other scenarios. (2) Average monthly operating revenue: The RBF kernel KNN model was used to complete the data, and the evaluation was satisfactory. After inverse logarithmic transformation, the completed data restored the original revenue level. Most samples were concentrated in a reasonable range of RMB 500,000 to RMB 1,500,000, with the peak revenue of high-end hotels controlled at around RMB 3,000,000, effectively avoiding the interference of extreme revenue values on the overall data distribution. The completed data can be directly used for business scenarios such as financial accounting, cost control, and revenue forecasting, supporting financial decision-making. (3) OTA platform rating: The original KNN model was used to complete the assessment. The completed rating was restored to the original range of 1-5 points, which is concentrated in the high-quality service range of 4-4.5 points. There were no abnormal ratings below 3 points. The rating is highly consistent with the actual service quality of the hotel. The completed data can be used for service quality assessment and customer satisfaction analysis, providing data support for service optimization. Validation of closed-loop effectiveness: Through this closed-loop reconstruction, the missing rate of 18 core indicators of 5260 samples was reduced to 0, the outlier correction qualification rate was 100%, the MRE after all indicators were completed was ≤0.15, the distribution consistency ratio was in the range of 0.9~1.1, and the grade matching degree was ≥89%, which fully verified the effectiveness and stability of the data integrity closed-loop reconstruction process of the present invention.
[0051] This invention, based on hotel operating data, describes the implementation process of the proposed multi-source data integrity reconstruction model. It selects three typical indicators: average monthly occupancy rate, average monthly revenue, and OTA platform ratings, covering three core data distribution types (heavy-tailed, skewed, and near-normal) and three business dimensions (operations, finance, and marketing). The algorithm optimization, adaptive parameter solving, multi-dimensional error assessment, and closed-loop reconstruction processing logic strictly follow the overall technical solution of this invention, completely replicating the entire indicator processing flow. For the remaining 15 core indicators, accurate reconstruction can be achieved simply by matching the optimal algorithm based on their own distribution characteristics, missing rate, and other core features, referring to the typical indicator process and completing calibration verification. This fully demonstrates the universal applicability of the method. Furthermore, the multi-source data integrity reconstruction model based on multi-algorithm fusion provided by this invention has strong cross-industry adaptability. It requires no reconstruction of the core framework; only adjustments to details such as derived indicator calculation and threshold setting are needed. It can be quickly applied to data integrity reconstruction in multiple industries such as catering and transportation, significantly improving the efficiency of model engineering applications and scenario adaptability, demonstrating outstanding practicality and versatility.
[0052] The multi-source data integrity reconstruction model based on multi-algorithm fusion provided by this invention integrates GNN-KNN and multi-variant KNN algorithms. Through an automatic algorithm optimization mechanism, it achieves accurate matching between data with different distribution characteristics (normal, skewed, heavy-tailed, highly concentrated) and the optimal completion algorithm, breaking through the limitations of traditional single algorithms. The completion results are more in line with the inherent laws of the data and business logic. Actual testing shows that the completion error is reduced by an average of more than 30%. The test data consists of 5260 hotel operation samples and 18 core indicators in the example, with a uniform missing rate of 0.1~0.5 (simulating the scenario of missing business data). The error calculation is based on the deviation between the completion result and the actual data. The error comparison between the multi-algorithm fusion completion model used in this invention and traditional completion methods (mean imputation, single KNN) is shown in Table 4, demonstrating significant verification results.
[0053] Table 4
[0054] The above-disclosed embodiments are merely specific examples of the present invention. However, the present invention is not limited thereto, and any variations that can be conceived by those skilled in the art should fall within the protection scope of the present invention.
Claims
1. A multi-source data integrity reconstruction model based on multi-algorithm fusion, characterized in that, include: The module includes a data preprocessing module, an algorithm optimization module, a completion and execution module, an error verification module, and a result output module. The data preprocessing module is used to preprocess multi-source raw data to generate a dataset to be reconstructed and transmit it to the algorithm optimization module; the dataset to be reconstructed includes core indicators and valid samples; the preprocessing includes data integration, data transformation, outlier identification and data cleaning. The algorithm optimization module receives the dataset to be reconstructed, loads a multi-algorithm fusion completion model to analyze the data distribution characteristics and extract core indicator features of the dataset to be reconstructed, formulates algorithm matching rules, and outputs the optimal completion algorithm matched to the dataset to be reconstructed to the completion execution module; the core indicator features include missing rate, distribution characteristics, correlation strength, and dispersion. The completion execution module is used to receive the matched optimal completion algorithm, perform adaptive parameter solving on the matched optimal completion algorithm to achieve automatic calibration; based on the calibrated matched optimal completion algorithm, perform completion operation on the dataset to be reconstructed to generate a preliminary reconstructed dataset, and transmit it to the error verification module. The error verification module is used to receive the preliminary reconstructed dataset, perform a multi-dimensional error comprehensive evaluation on the preliminary reconstructed dataset, and generate a multi-dimensional error comprehensive evaluation result; when the multi-dimensional error comprehensive evaluation result is qualified, the corresponding preliminary reconstructed dataset is output as the final reconstructed dataset to the result output module. The result output module is used to receive the final reconstructed dataset and generate an evaluation report based on the final reconstructed dataset. The evaluation report includes an overall overview of the reconstructed process, multi-dimensional error evaluation results of each indicator feature, outlier identification records, iterative optimization process, and the final reconstructed dataset.
2. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 1, characterized in that, The multi-dimensional error comprehensive assessment is based on a multi-dimensional error assessment system, which includes distribution consistency assessment, threshold proportion assessment, error quantification assessment, and level matching assessment. The rules for the multi-dimensional error comprehensive assessment are as follows: When all evaluation results of the multi-dimensional error evaluation system are qualified, the comprehensive multi-dimensional error evaluation result is qualified; when any evaluation result of the multi-dimensional error evaluation system is unqualified, the comprehensive multi-dimensional error evaluation result is unqualified, triggering a fallback completion mechanism to re-match the completion algorithm, perform completion operation and error verification on the unqualified data column until all multi-dimensional error evaluation results are qualified.
3. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 2, characterized in that, The method for determining whether the evaluation results of the multi-dimensional error evaluation system are qualified is as follows: Set multi-dimensional error assessment thresholds, including distribution consistency assessment threshold, threshold proportion assessment threshold, error quantification assessment threshold, and level matching assessment threshold; When the multi-dimensional error evaluation result of the preliminary reconstructed dataset exceeds the multi-dimensional error evaluation threshold, it is judged as unqualified, and the fallback completion mechanism is triggered. When the multi-dimensional error evaluation result of the preliminary reconstructed dataset does not exceed the multi-dimensional error evaluation threshold, it is judged as qualified, and the corresponding preliminary reconstructed dataset is output.
4. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 3, characterized in that, The fallback completion mechanism is set in the algorithm selection module. The fallback completion mechanism is as follows: for the data columns whose multi-dimensional error evaluation results are unqualified, match the GNN-KNN model, re-execute the completion operation and error verification, until all the multi-dimensional error evaluation results are qualified.
5. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 1, characterized in that, The multi-algorithm fusion completion model includes a GNN-KNN model and a multi-variant KNN model. The GNN-KNN model converts the tabular data in the dataset to be reconstructed into a graph structure, extracts feature indicators, and realizes K-nearest neighbor query based on Euclidean distance calculation. The multi-variant KNN model includes three variant models: original KNN, RBF kernel KNN, and Sigmoid kernel KNN. Specifically, the variant model of the original KNN directly calculates the Euclidean distance after standardizing the dataset to be reconstructed; the variant model of the RBF kernel KNN optimizes the distance calculation through the RBF kernel function; and the variant model of the Sigmoid kernel KNN calculates the distance after transforming the data through the Sigmoid function.
6. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 5, characterized in that, The data distribution characteristic analysis includes the following: The data distribution characteristics are analyzed using a dual-dimensional approach of confidence interval (CI) and highest density interval (HDI). A distribution coverage threshold is set, and the HDI / CI width ratio is calculated. The data distribution type is then determined based on the HDI / CI width ratio. Wherein, 0.8 ≤ HDI / CI width ratio ≤ 1 indicates normally distributed data; The HDI / CI width ratio > 1.2 indicates skewed / heavy-tailed distribution data; The HDI / CI width ratio <0.8 indicates highly concentrated data distribution.
7. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 6, characterized in that, The matching rules of the algorithm include the following: For skewed / heavy-tailed distribution data in the dataset to be reconstructed, if the correlation strength is > 0.4, the GNN-KNN model is matched; if 0.2 < correlation strength ≤ 0.4, the RBF kernel KNN model is matched; if the correlation strength ≤ 0.2, the Sigmoid kernel KNN model is matched. For normally distributed data in the dataset to be reconstructed, if the missing rate is ≤0.3, the original KNN model is matched; if the missing rate is >0.3, the GNN-KNN model is matched. For highly concentrated data in the dataset to be reconstructed, the original KNN model is matched. If the number of non-empty samples in the dataset to be reconstructed is less than 2×K, then the GNN-KNN model is matched, where K is the number of nearest neighbors.
8. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 1, characterized in that, The outlier identification includes the identification of extreme values and missing values. The extreme value identification is performed by using the interquartile range (IQR) to identify the extreme values of each indicator. The missing value identification is performed by scanning each column of the multi-source raw data to identify missing values. The extreme values include values that are lower than Q1-1.5IQR or higher than Q3+1.5IQR. The missing values include null values, empty strings, and invalid identifiers.
9. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 8, characterized in that, The valid samples are data columns with a missing rate ≥ a preset missing rate; the expression for calculating the missing rate is: Missing rate = number of missing values / total number of data samples.
10. The multi-source data integrity reconstruction model based on multi-algorithm fusion as described in claim 5, characterized in that, The parameter adaptive solution is to set differentiated parameter adaptive solution logic for different models in the multi-algorithm fusion completion model; The adaptive parameter solving logic of the GNN-KNN model is to set the hidden layer dimension, output dimension, and nearest neighbor number K, and dynamically adjust the number of undirected edges in the graph structure based on the number of samples. The adaptive parameter solving logic of the multi-variant KNN model involves standardizing the data and setting the kernel function parameters for RBF kernel KNN and Sigmoid kernel KNN. Dynamically fine-tuning kernel function parameters based on data dispersion .