A power grid heterogeneous data fusion processing method and component
By using automatic identification and dynamic mapping mechanisms to process heterogeneous power grid data, this approach addresses the issues of low intelligence in data source identification, insufficient conversion flexibility, and weak fusion processing capabilities. It enables rapid access, flexible conversion, and high-quality fusion, making it suitable for the data processing needs of power grids and other industries.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUBEI CENT CHINA TECH DEV OF ELECTRIC POWER
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing heterogeneous data processing solutions for power grids lack intelligent data source identification capabilities, have insufficient data conversion flexibility, weak data fusion processing capabilities, and poor component reusability, resulting in poor system adaptability, high maintenance costs, slow scalability and response speed, and difficulty in reusing them across different projects.
The system automatically identifies the type and format of heterogeneous power grid data sources, and achieves rapid access to data sources through feature extraction and similarity matching; it establishes a dynamic mapping mechanism based on a rule engine to support flexible configuration of conversion rules; it uses a multi-dimensional data fusion algorithm to handle data conflicts and generate high-quality fused data; and it designs a microservice and component-based architecture to achieve low coupling and high cohesion between components.
It enables rapid access and automatic adaptation of heterogeneous power grid data, reduces system maintenance costs, improves system flexibility and maintainability, generates unified, high-quality fused data, supports power grid business applications, and has good versatility and scalability.
Smart Images

Figure CN121808705B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of power information technology, specifically to a method and components for the fusion processing of heterogeneous power grid data. Background Technology
[0002] As a critical national infrastructure, the power system involves multiple links such as power generation, transmission, transformation, distribution, and consumption. Each link generates a large amount of data. With the advancement of smart grid construction, grid data is showing characteristics of massive volume, multi-source, and heterogeneity.
[0003] Power grid data mainly includes: real-time operational data (measurement data of voltage, current, power, and frequency from SCADA systems, with high sampling frequency and large data volume), equipment ledger data (static information recording the model, parameters, location, and status of power grid equipment), usually stored in relational databases, geographic information data, data on power grid topology and equipment spatial location from GIS systems, mostly in vector form, meteorological data, including environmental data such as temperature, humidity, wind speed, and rainfall, which affect power grid load and equipment status, marketing data, business data such as user electricity consumption information, electricity bill data, and service work orders, and video surveillance data, video surveillance image data of substations and transmission lines.
[0004] These data are scattered across different systems, and the data formats include various forms such as relational database tables, time-series databases, XML files, JSON files, CSV files, and image files;
[0005] Currently, the following solutions are mainly used for processing heterogeneous data from power grids:
[0006] Existing technical solution 1: A data integration solution based on ETL tools. This solution uses traditional ETL (Extract-Transform-Load) tools (such as Informatica, Kettle, and other data integration software) to extract, transform, and load heterogeneous data. The implementation steps are as follows: configure data source connections, connect to each heterogeneous data source respectively, define data extraction rules, extract the required data from the source system, perform data transformation, including format conversion, field mapping, and data cleaning, load the transformed data into the target data warehouse, and use scheduling tools to periodically execute ETL tasks to keep the data synchronized.
[0007] Existing technical solution 2: Data fusion solution based on data middle platform. This solution builds an enterprise-level data middle platform, establishes a unified data model and data standards, and is implemented by: formulating enterprise data standard specifications, including data dictionary, coding standards, and naming specifications; establishing a data access layer to provide standardized access interfaces for various heterogeneous data sources; using data governance tools to monitor data quality and manage data lineage; and providing unified data access services at the data service layer. The system architecture of this solution includes: data source layer, data access layer, data storage layer, data processing layer, data service layer, and data application layer.
[0008] Existing technical solution 3: Data exchange solution based on data bus. This solution establishes an enterprise service bus and realizes data exchange between heterogeneous systems through message queues and service interfaces. Its structure includes: adapter components, which develop dedicated adapters for different data sources to realize protocol conversion and format conversion; message queues, such as Kafka and RabbitMQ, for asynchronous data transmission; routing engine, which routes messages according to data type and target system; conversion engine, which performs data format conversion and content mapping; and monitoring components, which monitor the data exchange status and abnormal conditions.
[0009] However, the above-mentioned existing technical solutions have the following drawbacks:
[0010] Disadvantage 1: Lack of intelligent data source identification capability. Existing ETL solutions and data middle platform solutions require manual configuration of data source type, data format, and field mapping relationship information. When the power grid adds a new data source or the data source structure changes, it is necessary to reconfigure and develop, resulting in poor system adaptability and high maintenance costs. This is because the existing solutions use a static configuration method and lack the ability to automatically identify data source characteristics and extract patterns.
[0011] Disadvantage 2: Insufficient flexibility in data standardization and transformation. The data transformation rules of existing solutions are usually hard-coded in ETL scripts or adapters. When the data standard changes or new data formats need to be supported, the code needs to be modified and redeployed. This is because the transformation logic is tightly coupled with the processing flow and lacks dynamic mapping mechanisms and rule engine support, resulting in poor system scalability and slow response to changes.
[0012] Disadvantage 3: Weak heterogeneous data fusion processing capability. Existing solutions mainly focus on data format conversion and loading, and are insufficient in the fusion processing capability for semantic association, temporal alignment and conflict resolution between heterogeneous data. When the data of the same entity is scattered in multiple data sources and there are inconsistencies, there is a lack of effective fusion strategies to generate a unified, high-quality data view. This is because existing solutions lack multi-dimensional data fusion algorithms and intelligent conflict resolution mechanisms.
[0013] Disadvantage 4: Poor component reusability and versatility. Existing data processing solutions are mostly customized developments for specific business scenarios. The functional modules of data access, transformation, and fusion are highly coupled, making it difficult to reuse them across different projects. When similar data processing capabilities are needed in other power grid companies or other industries, they need to be redeveloped, resulting in serious duplication of construction and long development cycles. Summary of the Invention
[0014] The purpose of this invention is to provide a method and components for the fusion processing of heterogeneous power grid data, so as to solve the technical problems of low intelligence in data source identification, insufficient flexibility in data conversion, weak data fusion processing capability, and poor reusability of components in the prior art.
[0015] To achieve the above objectives, the present invention provides the following technical solution:
[0016] A method for fusing heterogeneous power grid data includes the following steps:
[0017] S1. Receive raw data from multiple heterogeneous data sources, including SCADA system, load management system, power distribution automation system, marketing system, GIS system and meteorological system;
[0018] S2. Automatically identify each heterogeneous data source, including: extracting sample data from the original data, wherein the number of sample data is 100; extracting structural features, content features, semantic features, temporal features, and correlation features of the sample data; quantizing and encoding the extracted features to generate feature vectors; matching the feature vectors with the pre-stored data source template feature vectors based on similarity, and determining the data source type when the similarity is greater than 0.75.
[0019] S3. Extract data schema information from the data source, including field names, field types, field constraints, and field semantics;
[0020] S4. Based on the data pattern information and preset conversion rules, perform standardization conversion on the original data. The standardization conversion includes field mapping, type conversion, value range conversion, unit conversion, and format conversion to generate standardized data.
[0021] S5. Perform fusion processing on standardized data from different data sources, including: extracting entity identification features of data records; calculating the comprehensive similarity between different data records; clustering data records into entities based on the comprehensive similarity, and clustering data records with similarity greater than a preset threshold into the same entity; performing conflict detection on multiple data records belonging to the same entity; when a conflict is detected, calculating the credibility of each data source, and applying conflict resolution strategies based on the credibility.
[0022] S6. Organize and store the merged data according to a unified data model;
[0023] S7. Provide integrated data access services to external parties through the data service interface.
[0024] Furthermore, the similarity matching process in S2 includes: extracting structural features, including the number of fields, field names, field types, and nesting levels; extracting content features, including data type, numerical range, character length distribution, null value rate, and number of unique values; extracting semantic features by matching with a power grid terminology dictionary to identify the business meaning of the fields; extracting time-series features, including the format, time granularity, time range, and sampling frequency of time fields; extracting association features, including the correlation coefficient between fields and primary / foreign key relationships; and generating a fixed-dimensional feature vector after normalizing the above features, wherein the dimension of the feature vector is 100 to 200.
[0025] Furthermore, the conflict resolution strategies in S5 include: a credibility priority strategy, which selects the value of the data source with the highest credibility value as the fusion result; a weighted average strategy, which calculates the fusion value using a weighted average formula for numerical conflicts; a majority voting strategy, which selects the value with the highest frequency of occurrence as the fusion result for enumerated conflicts; and a timeliness priority strategy, which selects the value of the data source with the most recent data update time as the fusion result.
[0026] Furthermore, in step S5, for time-series data containing timestamps, a time-series alignment operation is performed before fusion processing, including: determining a unified time base and time granularity; normalizing the timestamps of each data source to unify the time zone and time format; identifying missing time points in the base time series; estimating the data for the missing time points using interpolation methods, including linear interpolation, spline interpolation, and forward padding; downsampling data with a sampling frequency higher than the base granularity; aggregating the original data within each base time interval; and synchronizing and aligning the time-series data from each data source according to the base time series.
[0027] The present invention also provides a fusion processing component for heterogeneous power grid data, comprising:
[0028] The data acquisition module is used to collect raw data from multiple heterogeneous data sources, including SCADA systems, load management systems, power distribution automation systems, marketing systems, GIS systems, and meteorological systems.
[0029] The data source identification module is used to automatically identify heterogeneous data sources. It includes a feature extraction unit, a feature vector generation unit, and a template matching unit. The feature extraction unit is used to extract structural features, content features, semantic features, temporal features, and correlation features of the data. The feature vector generation unit is used to quantize and encode the extracted features into feature vectors of fixed dimensions. The template matching unit is used to perform similarity matching between the feature vectors and pre-stored data source templates. When the similarity is greater than 0.75, the data source type is determined.
[0030] The schema extraction module is used to extract data schema information from the data source, including field names, field types, field constraints, and field semantics, and to generate a schema description document.
[0031] The data conversion module is used to perform standardized conversion on the original data according to the data pattern information and preset conversion rules. The standardized conversion includes field mapping, type conversion, value range conversion, unit conversion and format conversion.
[0032] The data fusion module is used to fuse standardized data from different data sources, including a similarity calculation unit, an entity clustering unit, a conflict detection unit, and a conflict resolution unit.
[0033] The data service module is used to organize and store the merged data according to a unified data model, and to provide data access services to the outside world through standardized interfaces.
[0034] Furthermore, the data source identification module also includes: a metadata management unit, used to store the feature vector and pattern description document of the data source template, establish a mapping relationship between the data source identifier and the pattern description, and perform version management on the pattern description document.
[0035] Furthermore, the data transformation module includes: a rule engine for loading, parsing, and executing transformation rules, including field mapping rules, type conversion rules, value domain conversion rules, unit conversion rules, format conversion rules, and calculation derivation rules; a mapping converter for performing data transformation operations according to the rules parsed by the rule engine; a data cleaner for cleaning the transformed data, including null value handling, duplicate value removal, outlier correction, and format normalization; and a transformation log recorder for recording operation information, transformation results, and exception information during the transformation process.
[0036] Furthermore, in the data fusion module: the similarity calculation unit is used to calculate the comprehensive similarity, which is calculated by weighted summation, combining identifier similarity, attribute similarity, and spatiotemporal correlation; the conflict resolution unit selects a conflict resolution strategy based on the data source credibility and conflict type, and the conflict resolution strategy includes credibility priority strategy, weighted average strategy, majority voting strategy, and timeliness priority strategy.
[0037] Compared with the prior art, the beneficial effects of the present invention by adopting the above technical solution are as follows:
[0038] (1) This invention can automatically identify the type, format and structural characteristics of heterogeneous power grid data sources through data source feature extraction and pattern recognition technology. It can achieve rapid access to data sources without manual configuration. When a new data source is added or the data source structure changes, the system can automatically adapt, which greatly reduces the system maintenance cost and dependence on professional technicians.
[0039] (2) This invention establishes a dynamic mapping mechanism based on a rule engine, which decouples data transformation rules from the processing flow and supports flexible definition of transformation rules through configuration files. When data standards change, only the rule configuration needs to be modified without modifying the code. The system can respond quickly to changes, improving the system's flexibility and maintainability.
[0040] (3) This invention proposes a multi-dimensional data fusion algorithm that comprehensively considers data similarity, temporal alignment, semantic association and credibility. It can effectively handle the conflict and inconsistency between heterogeneous data. Through intelligent conflict resolution strategy, it generates unified and high-quality fused data, which improves the integrity, accuracy and consistency of data and provides reliable data support for power grid business applications.
[0041] (4) The present invention adopts the design concept of microservice and componentization, and encapsulates the functions of data collection, identification, transformation, fusion and service into independent reusable components. The components communicate with each other through standard interfaces, with low coupling and high cohesion. This design allows each component to be deployed independently and combined flexibly. It is not only applicable to the power grid industry, but can also be quickly ported to other industries that need to process heterogeneous data, and has good versatility and scalability. Attached Figure Description
[0042] Figure 1 This is a schematic diagram of the overall system architecture of the present invention;
[0043] Figure 2 This is a schematic diagram of the core processing flow of the present invention;
[0044] Figure 3 This is a schematic diagram of the data source identification and pattern extraction process of the present invention;
[0045] Figure 4 This is a schematic diagram of the data standardization and transformation architecture of the present invention;
[0046] Figure 5 This is a schematic diagram of the multi-dimensional data fusion processing flow of the present invention;
[0047] Figure 6 This is a schematic diagram of the component-based implementation architecture of the present invention. Detailed Implementation
[0048] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.
[0049] Example 1
[0050] like Figures 1 to 6 As shown in the figure, this embodiment provides a method for fusing heterogeneous data from a power grid. This method is used to solve the problems of integrating, standardizing, and fusing multi-source heterogeneous data in a power grid system.
[0051] Terminology Explanation:
[0052] Heterogeneous data refers to data from different sources, in different formats, and with inconsistent structures, including structured data (such as relational databases), semi-structured data (such as XML and JSON), and unstructured data (such as text and images).
[0053] SCADA system: Supervisory Control And Data Acquisition, is a data acquisition and monitoring system used for real-time data acquisition and monitoring of the power grid.
[0054] Data fusion: The process of integrating, linking, and uniformly representing data from multiple data sources to obtain more accurate and complete information.
[0055] Data schema: A specification that describes the data structure, type, and constraint information.
[0056] Time-series data: Data sequences recorded in chronological order; measurement data in power grids are mostly time-series data.
[0057] I. Overall System Architecture
[0058] like Figure 1 As shown, the overall system architecture of this invention adopts a layered component-based design, starting from the heterogeneous data source at the bottom layer, passing through the data acquisition layer, data identification layer, data conversion layer, data fusion layer, and finally the data service layer, ultimately providing a unified data access service for upper-layer applications.
[0059] The system architecture includes the following main layers:
[0060] Data source layer: This layer contains various heterogeneous data sources from the power grid, including SCADA systems, load management systems, distribution automation systems, marketing systems, GIS systems, and meteorological systems. These data sources have different data formats, including relational databases, time-series databases, XML files, JSON files, and CSV files.
[0061] Data Acquisition Layer: Responsible for acquiring raw data from different data sources and performing initial caching. It provides adapters for multiple protocols and supports data acquisition methods such as JDBC, HTTP, FTP, and MQTT.
[0062] Data recognition layer: Automatically identifies data source types and extracts data patterns, enabling rapid identification of data sources without manual configuration.
[0063] Data Transformation Layer: Converts data into a standard format according to rules, using a dynamic mapping mechanism based on a rule engine, and supports flexible configuration of transformation rules.
[0064] Data fusion layer: Executes multi-dimensional data fusion algorithms, handles data conflicts and inconsistencies, and generates high-quality fused data.
[0065] Data Service Layer: Provides unified data access services, offering data query, subscription, and export functions through RESTful API interfaces.
[0066] Application layer: Various power grid business applications, including power grid monitoring, load forecasting, fault diagnosis, and asset management business systems.
[0067] The layers are connected through standardized interfaces, and data is processed sequentially from bottom to top. Each layer is an independent and reusable component.
[0068] II. Core Processing Flow
[0069] like Figure 2 As shown, the core processing flow of the heterogeneous power grid data fusion processing method of the present invention includes the following steps:
[0070] Step S1: Receive heterogeneous data sources
[0071] The system receives raw data from multiple heterogeneous data sources, including SCADA systems, load management systems, distribution automation systems, marketing systems, GIS systems, and meteorological systems. Each data source has a different data format, interface protocol, and update frequency.
[0072] Step S2: Data Source Identification
[0073] The received raw data is identified by first determining whether the data source is of a known type. If it is, the corresponding data pattern is loaded directly. If it is not, the automatic identification process is executed.
[0074] The automatic identification process includes: extracting 100 sample data from the original data, extracting multidimensional features of the sample data, including structural features, content features, semantic features, temporal features and association features, generating feature vectors, matching them with pre-stored data source templates based on similarity, and determining the data source type when the similarity is greater than 0.75.
[0075] If automatic identification fails, the data source type needs to be manually labeled, and the feature vector of the new type needs to be stored in the template library.
[0076] Step S3: Pattern Extraction
[0077] Extract data schema information from the data source, including field names, field types, field constraints, and field semantics. For relational databases, obtain table structure information by querying the system metadata table. For file-type data, obtain the field list by parsing the file header or traversing the file content.
[0078] Generate structured schema description documents and store them in the metadata database.
[0079] Step S4: Data Standardization Transformation
[0080] Based on the extracted data pattern information and preset transformation rules, the original data undergoes a standardization transformation, which includes:
[0081] Field mapping: Maps source data fields to target standard fields;
[0082] Type conversion: Converts the source data type to the target data type;
[0083] Range transformation: Mapping source data values to a target range;
[0084] Unit conversion: Converting units for physical quantities;
[0085] Format conversion: unifying data formats, such as date and time formats and character encoding.
[0086] The transformation process is driven by a rules engine, and the transformation rules are stored in the form of configurations that can be dynamically loaded and updated.
[0087] Step S5: Data Quality Inspection
[0088] The standardized data after conversion is subjected to quality checks, including: completeness check (checking whether required fields are empty), consistency check (checking whether field values meet constraints), and accuracy check (checking whether values are within a reasonable range).
[0089] If a quality issue is detected, a quality issue log is recorded and an alert is issued, but the data processing flow is not interrupted.
[0090] Step S6: Entity Extraction
[0091] Entity identification features are extracted from standardized data. Entity identification includes equipment number, line name, and substation ID. For data without a clear and unique identifier, a composite identifier is generated by combining multiple fields.
[0092] Step S7: Similarity Calculation
[0093] The overall similarity between different data records is calculated to determine whether they describe the same entity. The overall similarity takes into account identifier similarity, attribute similarity, and spatiotemporal relevance.
[0094] The formula for calculating the overall similarity is:
[0095] ;
[0096] in, This represents the overall similarity score, with a value ranging from 0 to 1. The similarity of the identifiers is calculated using the edit distance algorithm. The attribute similarity is represented and calculated using the cosine similarity algorithm. It represents spatiotemporal correlation, calculated based on timestamp differences and geographical distance.
[0097] , , These represent the corresponding weight coefficients. The value range is from 0.3 to 0.5. The value range is from 0.3 to 0.5. The value range is from 0.1 to 0.3, and + + =1.
[0098] Configure weight coefficients based on data characteristics and business needs. For data with complete and accurate identifier fields... Take the larger value; for data with rich attributes, Take the larger value; for data with strong spatiotemporal correlation, Take the larger value.
[0099] Step S8: Determine if it is time series data
[0100] Determine if the data contains a timestamp field. If it does, perform time-series alignment. If it does not contain a timestamp, perform conflict detection directly.
[0101] Step S9: Timing Alignment
[0102] For time-series data, timestamp alignment is performed. Time-series alignment includes:
[0103] Establish a unified time base and time granularity, and select a time granularity of 5 minutes;
[0104] The timestamps of each data source are standardized, with the time zone set to UTC and the time format set to ISO8601.
[0105] Identify missing time points in the baseline time series;
[0106] For missing time points, interpolation methods are used to estimate the data. Interpolation methods include linear interpolation, spline interpolation, and forward imputation. The appropriate interpolation method is selected according to the characteristics of the data.
[0107] Data with a sampling frequency higher than the baseline granularity is downsampled, and the original data is aggregated by averaging within each 5-minute interval;
[0108] Synchronize and align the time-series data from each data source according to the baseline time series.
[0109] Step S10: Collision Detection
[0110] Conflict detection is performed on multiple data records belonging to the same entity. Conflict types include:
[0111] Numerical conflict: The values of numerical attributes are inconsistent and exceed the allowable error range;
[0112] Value conflict: Inconsistent values for enumerated properties;
[0113] Time conflict: Inconsistent timestamps;
[0114] Integrity conflict: Some data sources are missing critical attributes.
[0115] For each attribute field, determine whether there is a conflict. If there is, record the conflict type and the set of conflict values.
[0116] Step S11: Determine if a conflict exists.
[0117] If a conflict is detected, the conflict resolution process is executed; if no conflict exists, the data is merged directly.
[0118] Step S12: Calculate the credibility of the data source
[0119] To resolve the conflict, it is necessary to assess the credibility of each data source. The credibility of the data source should be considered in conjunction with the following factors:
[0120] Historical accuracy: The accuracy of historical data from this data source, calculated by comparing it with standard data;
[0121] Data integrity: The degree of completeness of the data provided by this data source;
[0122] Data timeliness: the frequency and delay of data updates;
[0123] Data source authority: the degree of official recognition of the data source.
[0124] Credibility is represented by a score between 0 and 1, with higher scores indicating higher credibility. The credibility calculation formula is as follows:
[0125] ;
[0126] in, Indicates the credibility of the data source; Indicates historical accuracy; Indicates the integrity score; Indicates timeliness score; Indicates an authoritative score.
[0127] , , , These are the weighting coefficients. The value is 0.4. The value is 0.2. The value is 0.2. The value is 0.2, and + + + =1.
[0128] Step S13: Apply conflict resolution strategies
[0129] Choose an appropriate conflict resolution strategy based on the conflict type and the credibility of the data source:
[0130] Strategy 1: Credibility-first strategy. This strategy selects the data source with the highest credibility value as the fusion result. This strategy is suitable for situations where there are significant differences in the credibility of the data sources.
[0131] Strategy 2: Weighted average strategy. For numerical conflicts, the fusion value is calculated using the following formula:
[0132] ;
[0133] in, Indicates the fusion value. Indicates the number of data sources. Indicates the first The credibility of each data source Indicates the first This strategy utilizes data from multiple sources to comprehensively leverage information from these sources and reduce random errors from a single data source.
[0134] Strategy 3: Majority voting strategy. For enumerated conflicts, count the frequency of each value and select the value with the highest frequency as the fusion result. If the total credibility of the data source corresponding to a certain value is dominant (more than 50% of the total credibility), then select that value.
[0135] Strategy 4: Timeliness-first strategy, which selects the data from the most recent data update time as the fusion result. This strategy is suitable for scenarios where data changes frequently and timeliness is critical.
[0136] If the conflict cannot be resolved automatically, it will be marked as a manual review status and submitted to a human for judgment and processing.
[0137] Step S14: Merge and generate fused data
[0138] The resolved attribute values are assembled into the final fused data record, which includes entity identifier, fusion timestamp, fused value of each attribute, data source identifier of each attribute, credibility of each attribute, and conflict resolution flag.
[0139] Step S15: Quality Assessment
[0140] The quality score of the fused data is calculated, taking into account factors such as data integrity, data consistency, and fusion reliability. The quality score is expressed on a 100-point scale, with higher scores indicating better data quality.
[0141] Step S16: Update the unified data model
[0142] The merged data is organized and stored according to a unified data model, which includes power grid equipment entities, power grid node entities, power grid line entities, measurement data entities, and event record entities. These entities are connected through association relationships.
[0143] Step S17: Publish the change event
[0144] Publish data change events to data subscribers to notify downstream applications of new fused data generation. Change events are pushed via message queues and can be received by multiple subscribers simultaneously.
[0145] Step S18: Process multiple data sources in a loop
[0146] Repeat the above process to process the next heterogeneous data source until all data sources have been processed.
[0147] III. Data Source Identification and Pattern Extraction
[0148] like Figure 3 As shown, the detailed process of data source identification and pattern extraction is as follows:
[0149] Step A1: Receive raw data
[0150] The system receives raw data from heterogeneous data sources.
[0151] Step A2: Data format determination
[0152] Determine the format type of the original data. If it is structured data (such as a relational database table), extract the table structure. If it is semi-structured data (such as JSON or XML), parse the nested structure. If it is unstructured data (such as text), extract the text features.
[0153] Step A3: Extract the table structure
[0154] For structured data, table structure information, including table name, column name, data type, primary key, and foreign key, is obtained by querying the database metadata table.
[0155] Step A4: Parse the nested structure
[0156] For semi-structured data, parse its nested hierarchical structure and extract key names, value types, and hierarchical relationships.
[0157] Step A5: Extract text features
[0158] For unstructured data, extract keywords, word frequency, and text length from the text.
[0159] Step A6: Analyze field names
[0160] Analyze field names, extract field naming conventions and patterns, and identify the business meaning of fields by matching them with a dictionary of terms in the power grid field.
[0161] Step A7: Analyze data types
[0162] Analyze the data type of the field, including integer, floating-point, string, date, and boolean. For fields without an explicit type definition, infer the data type by attempting type conversion.
[0163] Step A8: Analyze the content sample
[0164] Perform statistical analysis on the field content to extract the numerical range, character length distribution, null value rate, number of unique values, common values and their frequency.
[0165] Step A9: Calculate the eigenvectors
[0166] The extracted structural features, content features, semantic features, temporal features, and association features are quantized and encoded to generate fixed-dimensional feature vectors with a dimension of 150.
[0167] The dimensions of the feature vector are allocated as follows: structural features account for 30 dimensions, including the number of fields, field type distribution, and nesting level; content features account for 40 dimensions, including numerical range, character length, and null value rate; semantic features account for 30 dimensions, including domain term matching degree; temporal features account for 30 dimensions, including time granularity and sampling frequency; and association features account for 20 dimensions, including the correlation coefficient between fields.
[0168] The feature vectors are normalized so that the values of each dimension are uniformly between 0 and 1.
[0169] Step A10: Match the data source template
[0170] The generated feature vectors are compared with the feature vectors of known data source templates stored in the metadata database. The similarity calculation uses the cosine similarity method to calculate the cosine value of the angle between the two feature vectors.
[0171] The calculation formula is:
[0172] ;
[0173] in, Indicates similarity. This represents the dimension of the feature vector, with a value of 150. The feature vector of the data to be identified is the first... dimensional components, Represents the template eigenvector of the th Dimensional components.
[0174] Step A11: Determine the matching degree
[0175] The data source template with the highest similarity value is selected as the matching result. If the similarity is greater than 0.75, the data source type is determined; if the similarity is less than 0.75, it is determined to be an unknown data source type.
[0176] Step A12: Determine the data source type
[0177] For a successfully matched data source, load the corresponding data pattern description from the metadata database.
[0178] Step A13: Mark as unknown type
[0179] For data sources that fail to match, mark them as unknown types, requiring manual annotation and the creation of new templates.
[0180] Step A14: Extract Data Pattern
[0181] Extract complete data schema information, including the field list, field name, field type, business semantics, constraints, and example values for each field.
[0182] Step A15: Generate schema description document
[0183] All extracted pattern information is organized into a structured pattern description document, which includes data source identifier, data source type, data source description, field list, relationship list, and extraction time.
[0184] Step A16: Store in metadata database
[0185] The generated schema description documents are stored in the metadata management module's metadata database, establishing a mapping relationship between data source identifiers and schema descriptions, and version management is performed.
[0186] IV. Data Standardization and Conversion
[0187] like Figure 4 As shown, the data standardization transformation processing architecture includes an input layer, a rule engine layer, a transformation processing layer, and an output layer.
[0188] The input layer consists of three elements:
[0189] 1. Source Data Pattern: A description of the data source pattern obtained from the pattern extraction module.
[0190] 2. Target data standard: Predefined unified data standard, including standard field definitions, data type specifications, and unit specifications.
[0191] 3. Conversion rule base: Stores configuration files or database tables for various conversion rules.
[0192] The rules engine layer is responsible for loading, parsing, and executing transformation rules. The workflow of the rules engine is as follows:
[0193] Step B1: Load conversion rules
[0194] Based on the data source type identifier of the source data, the corresponding conversion rule set is retrieved from the conversion rule library. The conversion rules are stored in the form of configuration files and include field mapping rules, type conversion rules, value range conversion rules, unit conversion rules, format conversion rules, and calculation derivation rules.
[0195] Step B2: Parse the conversion rules
[0196] The content of the transformation rules is parsed and a rule tree is constructed. The nodes of the rule tree include rule type, rule parameters, rule conditions, and rule operations.
[0197] Step B3: Match applicable rules
[0198] Based on the fields and data characteristics currently being processed, the applicable transformation rules are matched. If multiple matching rules exist, they are sorted according to rule priority.
[0199] Step B4: Perform the conversion operation
[0200] Perform conversion operations according to the defined rules, including field value retrieval, type conversion, value mapping, unit conversion, and format adjustment.
[0201] The transformation processing layer includes a mapping converter, a data cleaner, and a transformation logger.
[0202] The mapping converter performs the following functions:
[0203] 1. Field mapping: Based on the field mapping rules, the source field is mapped to the target field. The mapping relationship includes one-to-one mapping, many-to-one mapping, and one-to-many mapping.
[0204] 2. Type Conversion: Based on the data type and type conversion rules of the target field, the source field value is converted to the target type. Type conversion includes string to number, number to string, and date and time format conversion.
[0205] 3. Format Conversion: Unify data formats, including date and time formats to ISO 8601 format, character encoding to UTF-8 encoding, and decimal places to 2 digits.
[0206] 4. Encoding Conversion: Perform encoding conversion on enumeration type fields, for example, convert "on" to 1 and "off" to 0.
[0207] 5. Unit conversion: Converting physical quantities to units, such as converting voltage from volts to kilovolts, with a conversion factor of 0.001.
[0208] Data cleaners perform cleaning processes on the transformed data:
[0209] 1. Null value handling: Based on the field's non-null constraint, decide whether to fill in the default value. For required fields, if the converted value is null, fill in the predefined default value.
[0210] 2. Duplicate value removal: For fields that require uniqueness, check for duplicate values; if any are found, mark them as exceptions.
[0211] 3. Outlier Correction: Based on the field's value range constraints, detect whether there are outliers that exceed the range. For numerical outliers, if the deviation is within 10%, correct it to the range boundary value; if the deviation exceeds 10%, mark it as an outlier and log it.
[0212] 4. Format standardization: unify character encoding, unify line breaks, and remove special characters.
[0213] The transformation logger records detailed information about the transformation process, including transformation details for each field, application of transformation rules, success or failure status of the transformation, and reasons for any exceptions. The transformation log is used for subsequent data quality analysis and problem tracing.
[0214] The output layer includes:
[0215] 1. Standardized data: The converted and cleaned data conforms to the format and requirements of a unified data standard.
[0216] 2. Transformation Log: Records detailed information about the transformation process, supporting data quality analysis and problem tracing.
[0217] V. Multi-dimensional data fusion processing
[0218] like Figure 5 As shown, the complete process of multi-dimensional data fusion processing is as follows:
[0219] Step C1: Input standardized data
[0220] Receive standardized data from the data conversion module. The standardized data has already undergone format conversion and data cleaning.
[0221] Step C2: Entity Recognition
[0222] Extract entity identifier features from data records. Entity identifiers include equipment number, line name, and substation ID. Entity identifiers are standardized by removing spaces and unifying capitalization.
[0223] Step C3: Construct the entity candidate set
[0224] Data records with the same or similar entity identifiers are grouped into a single entity candidate set. Data records in the same entity candidate set may come from different data sources and describe the same entity.
[0225] Step C4: Calculate the data similarity matrix
[0226] For N data records in the entity candidate set, calculate the comprehensive similarity between each pair and construct an N-by-N similarity matrix. The elements of the similarity matrix represent the similarity between the i-th record and the j-th record.
[0227] The formula for calculating the overall similarity is:
[0228] ;
[0229] Among them, identifier similarity The calculation method is as follows:
[0230] ;
[0231] in, This represents the edit distance between two strings. and These represent the lengths of the two strings respectively.
[0232] Attribute similarity The calculation method is as follows:
[0233] ;
[0234] in, and These represent the attribute vectors of the two records, and · represents the vector dot product. and These represent the magnitudes of the vectors.
[0235] Spatiotemporal correlation Combining temporal and spatial correlations:
[0236] ;
[0237] in, Indicates time correlation, calculated based on timestamp differences. This indicates spatial correlation, calculated based on geographical distance.
[0238] Step C5: Entity Clustering
[0239] Based on the similarity matrix, a clustering algorithm is used to group data records with high similarity into one class. The clustering algorithm is the density-based DBSCAN algorithm, with a similarity threshold of 0.8. Records with similarity greater than 0.8 are classified into the same neighborhood.
[0240] Step C6: Determine if it is time series data
[0241] Determine if the data record contains a timestamp field. If it does, perform time-series alignment; otherwise, perform conflict detection directly.
[0242] Step C7: Timing Alignment
[0243] For time-series data, timestamp alignment is performed. The detailed steps for time-series alignment include:
[0244] 1. Determine the time base: Select a time granularity of 5 minutes as the base.
[0245] 2. Timestamp standardization: The time zone is unified to UTC time, and the time format is unified to ISO 8601 format.
[0246] 3. Missing point identification: Compare the timestamps of each data source with the baseline time series to identify missing time points.
[0247] 4. Interpolation: For missing time points, linear interpolation is used to estimate the data. The linear interpolation formula is:
[0248] ;
[0249] in, This represents the estimated value of the missing points. and These represent the values of the previous and next points, respectively. , , These represent the corresponding timestamps.
[0250] 5. Downsampling: Downsample data with a sampling frequency higher than the baseline granularity, and calculate the average value within each 5-minute interval as the value for that interval.
[0251] 6. Synchronization Alignment: Synchronize and align the time series data from each data source according to the baseline time series to generate a time-aligned data matrix.
[0252] Step C8: Collision Detection
[0253] For multiple data records belonging to the same entity, conflict detection is performed by comparing the attribute values of each attribute field across different data sources.
[0254] 1. For numerical attributes, calculate the statistical characteristics of each data source value, including the mean and standard deviation, and determine whether each data source value is within a reasonable range. The reasonable range is defined as the mean plus or minus three times the standard deviation. Calculate the differences between pairs of values. If the relative error exceeds 5%, it is determined to be a numerical conflict.
[0255] 2. For enumerated attributes, compare the values from different data sources to see if they are consistent. If they are inconsistent, it is determined that there is a value conflict.
[0256] 3. For timestamps, compare the differences between timestamps from different data sources. If the difference exceeds 1 minute, it is determined to be a time conflict.
[0257] Step C9: Determine if a conflict exists.
[0258] If a conflict is detected, the conflict resolution process is executed; if no conflict exists, the data is merged directly.
[0259] Step C10: Analyze the conflict type
[0260] The specific types of conflicts are analyzed, including numerical conflicts, value conflicts, time conflicts, and integrity conflicts. Different conflict types are resolved using different strategies.
[0261] Step C11: Calculate the credibility of the data source
[0262] The credibility of each data source is assessed using the following formula:
[0263] ;
[0264] in, This represents historical accuracy, calculated by statistically analyzing the consistency between historical data from this data source and standard data. The value ranges from 0 to 1. This represents the integrity score, calculated based on the completeness of the data provided by the data source. The timeliness score is calculated based on the data update frequency and latency. This indicates an authoritative score, assigned based on the official recognition level of the data source.
[0265] Step C12: Apply conflict resolution strategies
[0266] Based on the conflict type and the credibility of the data source, select an appropriate conflict resolution strategy and execute the resolution operation.
[0267] Step C13: Determine if manual intervention is needed.
[0268] For conflicts that cannot be resolved automatically, determine whether manual intervention is required. If the reliability of each data source is comparable and the values differ significantly, manual processing is required; otherwise, generate fused data based on the automatic resolution results.
[0269] Step C14: Manual intervention
[0270] Mark it as being in manual review status, submit it to humans for judgment and processing, and update the fusion results and data source credibility after manual processing.
[0271] Step C15: Merge and generate fused data
[0272] The resolved attribute values are assembled into the final fused data record, which includes entity identifier, fused timestamp, fused value of each attribute, data source identifier of each attribute, and credibility of each attribute.
[0273] Step C16: Calculate the quality score of the fused data
[0274] The quality score of the fused data is calculated using the following formula:
[0275] ;
[0276] in, This represents the mass fraction, with a value ranging from 0 to 100. The consistency score is calculated based on the number of conflicts and the success rate of conflict resolution. The completeness score is calculated based on the field fill rate. This represents the credibility score, calculated based on the average credibility of the data source.
[0277] Step C17: Update the unified data model
[0278] The merged data is organized and stored according to a unified data model. If a new entity is added, a new entity record is created. If an entity is updated, the attribute values of the corresponding entity are updated.
[0279] VI. Component-based architecture
[0280] like Figure 6 As shown, this invention adopts a component-based architecture, and the system is divided into five core components:
[0281] 1. Data Acquisition Component
[0282] The data acquisition component is responsible for collecting raw data from heterogeneous data sources. Internally, it includes multiple protocol adapters: a JDBC adapter for relational database data acquisition, an HTTP adapter for RESTful API data acquisition, an FTP adapter for file data acquisition, and an MQTT adapter for IoT device data acquisition.
[0283] The acquisition component also includes a data cache queue for temporarily storing the raw data to be acquired, supporting batch processing and streaming processing. The scheduler is responsible for triggering acquisition tasks on a timed basis, supporting acquisition methods based on time intervals (such as every 5 minutes) or event-driven methods.
[0284] 2. Identification Components
[0285] The identification component is responsible for automatically identifying the data source type and extracting data patterns. It includes a feature extraction unit, responsible for extracting multi-dimensional features from the raw data; a pattern matching unit, responsible for matching feature vectors with data source templates; and a metadata management unit, responsible for storing and managing data source templates and pattern description documents.
[0286] The input to the identification component is raw data, and the output is a data source type identifier and a pattern description document.
[0287] 3. Conversion Components
[0288] The transformation component is responsible for converting the original data into a standard format according to the transformation rules. The transformation component includes a rule parsing and execution unit, which is responsible for loading, parsing and executing transformation rules; a field mapping and transformation unit, which is responsible for performing field mapping and data type conversion; a data quality detection unit, which is responsible for detecting the quality of the transformed data; and a cleaning and processing unit, which is responsible for cleaning and standardizing the data.
[0289] The input to the transformation component is raw data and a schema description document, and the output is standardized data.
[0290] 4. Fusion Components
[0291] The fusion component is responsible for the fusion processing of multi-source heterogeneous data. The fusion component includes a similarity calculation unit, which is responsible for calculating the comprehensive similarity between data records; a time sequence alignment unit, which is responsible for aligning time sequence data with timestamps; a conflict detection and resolution unit, which is responsible for detecting data conflicts and applying resolution strategies; and a fusion quality assessment unit, which is responsible for assessing the quality of the fused data.
[0292] The input to the fusion component is standardized data, and the output is fused data.
[0293] 5. Service Components
[0294] The service components are responsible for providing unified data access services to the outside world. The service components include a data query unit, which provides condition-based data query functions; a data subscription unit, which supports clients to subscribe to data change events; and an interface management unit, which manages the registration, authentication, and invocation of RESTful API interfaces.
[0295] The input to the service component is fused data, and the output is a data service interface.
[0296] The components communicate with each other through message passing or interface calls. The acquisition component and the recognition component exchange raw data through a data queue. The recognition component and the transformation component exchange pattern description documents through interface calls. The transformation component and the fusion component exchange standardized data through a data queue. The fusion component and the service component share fused data through a data storage layer.
[0297] The system also includes two auxiliary components: a configuration center and a monitoring center.
[0298] The configuration center centrally manages the configuration information of each component, including component parameter configurations such as collection frequency and cache size, conversion rule configurations such as field mapping rules and value domain conversion rules, and data model configurations such as entity definitions and relationship definitions of the unified data model. The configuration center supports dynamic configuration updates without requiring service restarts.
[0299] The monitoring center is responsible for the performance monitoring, log collection, and anomaly alarms of each component. The monitoring center collects the operating indicators of each component, such as acquisition rate, conversion power, and fusion time. It also collects the log information of each component, including operation logs, error logs, and audit logs. When anomalies are detected, such as acquisition failure, conversion error, or unresolved fusion conflicts, the monitoring center will issue an alarm notification in a timely manner.
[0300] The overall architecture adopts a microservice design concept, where each component can be deployed and scaled independently. It supports containerized deployment, with each component packaged as a Docker image and orchestrated and managed through Kubernetes. It also supports horizontal scaling, dynamically increasing the number of component instances based on data volume and processing load.
[0301] Example 2
[0302] This embodiment provides a specific implementation of a fusion processing component for heterogeneous power grid data.
[0303] This component is developed in Java and built on the Spring Boot framework to create a microservice architecture. Each functional component is encapsulated as an independent microservice, and service governance is performed through Spring Cloud.
[0304] Data acquisition module implementation
[0305] The data acquisition module uses the Apache Camel framework to achieve multi-protocol adaptation. For relational databases, the JDBC component is used to periodically execute SQL query statements to collect data. For HTTP interfaces, the HTTP component is used to periodically call RESTful APIs to obtain data. For file data, the File component is used to monitor file changes in a specified directory and read file contents.
[0306] The collected raw data is temporarily stored in a Redis cache queue. A producer-consumer pattern is adopted, where the collection module acts as the producer and writes data into the queue, and the recognition module acts as the consumer and reads data from the queue.
[0307] The scheduling management uses the Quartz framework to implement timed task scheduling and supports configuring the collection time using Cron expressions.
[0308] Data source identification module implementation
[0309] The feature extraction of the data source identification module adopts natural language processing technology. For the semantic feature extraction of field names, the Word2Vec word embedding model is used to convert the field names into vector representations, and then the similarity is calculated with the term vectors in the power grid domain terminology dictionary.
[0310] Template matching employs vector retrieval technology, utilizing the Faiss vector retrieval library to achieve efficient similarity search. It pre-indexes the feature vectors of all data source templates, enabling rapid retrieval of highly similar templates during queries.
[0311] Metadata management uses a PostgreSQL database storage schema to describe documents, and uses the JSONB data type to store semi-structured schema information, supporting flexible querying and updating.
[0312] Data conversion module implementation
[0313] The data transformation module's rule engine is implemented using the Drools rule engine. The transformation rules are written in DRL (Drools Rule Language), and the rule files are stored in the configuration center, supporting dynamic loading at runtime.
[0314] Field mapping is implemented using an expression engine, which uses the Apache Commons JEXL expression language to support complex field combinations and calculations. For example, the target field "full name" can be generated from the source fields "last name" and "first name" using the expression "last name plus first name".
[0315] Data cleaning uses the Apache Commons Validator framework for data validation, defining validation rules including non-empty validation, length validation, range validation, and regular expression validation.
[0316] Data fusion module implementation
[0317] The similarity calculation of the data fusion module adopts a distributed computing framework. For large-scale datasets, Apache Spark is used for parallel computing. After partitioning the data, the similarity matrix is calculated in parallel on multiple computing nodes.
[0318] The clustering algorithm is implemented using DBSCAN from the Scikit-learn machine learning library. The similarity matrix is used as the input to the distance matrix, and the similarity threshold of 0.8 corresponds to a distance threshold of 0.2.
[0319] The time-aligned interpolation calculations are implemented using interpolation functions from the Apache Commons Math library. Linear interpolation uses the LinearInterpolator class, and spline interpolation uses the SplineInterpolator class.
[0320] The credibility calculation for conflict resolution establishes a credibility assessment model. A logistic regression model is trained using historical data to predict the credibility of the data source. The input features of the model include historical accuracy, completeness, timeliness, and authority, and the output is a credibility score between 0 and 1.
[0321] Data service module implementation
[0322] The data service module uses the Spring MVC framework to implement the RESTful API interface. The query interface supports multiple combinations of query conditions and uses Spring Data JPA to build dynamic queries.
[0323] Data subscription uses WebSocket technology to achieve real-time push. The client connects to the server via WebSocket and subscribes to the data topics of interest. When new fused data is generated, the server pushes the data change event to the subscribing client via WebSocket connection.
[0324] Interface management uses the Spring Security framework for interface authentication and JWT (JSON Web Token) for user authentication and authorization.
[0325] Configuration center implementation
[0326] The configuration center is implemented using Spring Cloud Config. The configuration files are stored in a Git repository and support version control. Each component pulls the configuration information from the configuration center when it starts up and dynamically refreshes the configuration through Spring Cloud Bus.
[0327] Monitoring center implementation
[0328] The monitoring center uses Spring Boot Actuator to expose the component's runtime metrics, uses Prometheus to collect metric data, and uses Grafana for visualization.
[0329] Log collection uses the ELK (Elasticsearch, Logstash, Kibana) technology stack. Logs from each component are output to Logstash, which parses the logs and stores them in Elasticsearch. Kibana is then used for log querying and analysis.
[0330] Anomaly alerts are implemented using Alertmanager. When Prometheus detects anomalies, it triggers alert rules and sends alert notifications via Alertmanager, supporting multiple alert methods including email, SMS, and DingTalk.
[0331] Example 3
[0332] This embodiment illustrates the application effect of the present invention in a real power grid project.
[0333] A provincial power grid company has 10 municipal power supply bureaus. The information systems of these bureaus were built at different times and with different technical approaches, resulting in different data formats and standards. The provincial company needs to integrate the power grid data across the province and establish a unified data platform to support power grid dispatching, load forecasting, and equipment management applications.
[0334] Problems existing before application
[0335] 1. Numerous data sources: including SCADA systems, power distribution automation systems, GIS systems, and marketing systems from 10 cities, totaling 40 heterogeneous data sources.
[0336] 2. Inconsistent data formats: Some cities use Oracle databases, some use MySQL databases, some use SQL Server databases, some systems provide WebService interfaces, and some systems can only export Excel files.
[0337] 3. Inconsistent data standards: Equipment numbering rules vary from city to city, voltage level representation methods differ (some use kilovolts, some use volts), and timestamp formats differ.
[0338] 4. Inconsistent data quality: Some data sources have complete and accurate data, while others contain a large number of missing and outlier values.
[0339] Application of the present invention
[0340] Step 1: Deploy the fusion processing component
[0341] The fusion processing component of this invention is deployed in the provincial company's data center. It adopts a containerized deployment method and uses Kubernetes for orchestration and management. Based on the data volume and processing load, 3 collection component instances, 2 identification component instances, 3 conversion component instances, 2 fusion component instances, and 2 service component instances are deployed.
[0342] Step 2: Configure data source connection
[0343] Configure the connection information for 40 heterogeneous data sources in the configuration center, including the database connection string, WebService address, and file path. Configure the collection frequency to collect real-time data every 5 minutes and historical data once every morning.
[0344] Step 3: Automatically identify the data source
[0345] The system automatically identified 40 data sources, of which 35 data sources successfully matched known templates, 5 data sources were marked as unknown types, and the 5 unknown data sources were manually annotated to generate new data source templates and add them to the template library.
[0346] Step 4: Configure conversion rules
[0347] Based on the provincial company's unified data standards, conversion rules were configured. To address the issue of inconsistent equipment numbers, field mapping rules were configured to uniformly map the equipment number field from various cities to the standard field "Equipment ID". To address the issue of inconsistent voltage level units, unit conversion rules were configured to convert voltage values in volts to kilovolts. To address the issue of different timestamp formats, format conversion rules were configured to uniformly convert them to the ISO 8601 format.
[0348] Step 5: Perform data fusion
[0349] The system automatically collects data from various data sources, performs standardized transformation, and then performs fusion processing. For data from the same substation equipment, the system automatically identifies data from the SCADA system, GIS system, and equipment ledger system that describe the same equipment, calculates similarity, and performs clustering. When conflicts are detected in equipment parameters from different data sources, the system selects the real-time data from the SCADA system as the fusion result based on the credibility assessment results, because the historical accuracy rate of the SCADA system is 98%, and its credibility is high.
[0350] Step 6: Provide data services
[0351] The system organizes the merged data according to a unified data model and provides services to the outside world through a RESTful API. The provincial company's power grid dispatching system, load forecasting system, and equipment management system can access the merged data through the API interface, eliminating the need to connect to the data sources of each city separately.
[0352] Application effect
[0353] 1. Significantly improved data integration efficiency: The time from data source access to data availability has been shortened from an average of 2 weeks to 2 days. When a new data source is added, the system can automatically identify and quickly access it without writing special adaptation code.
[0354] 2. Significantly improved data quality: The completeness of the fused data increased from 75% to 95%, and the accuracy increased from 80% to 92%. Through multi-source data fusion and conflict resolution, data errors and inconsistencies were effectively reduced.
[0355] 3. Significantly reduced system maintenance costs: When the data standards of a certain city change, only the conversion rule configuration needs to be modified to adapt, without modifying the program code. The number of system maintenance personnel has been reduced from 8 to 2.
[0356] 4. Enhanced business application support capabilities: Unified data services make business application development more convenient, shorten the development cycle of new business systems by 30%, and improve data consistency and accuracy, making data-based analysis and decision-making more reliable.
[0357] In summary, the method and components for fusion processing of heterogeneous power grid data provided by this invention effectively solve the technical challenges of integrating heterogeneous data in power grid systems through adaptive data source identification, dynamic transformation based on rule engines, multi-dimensional data fusion algorithms, and a component-based system architecture, providing strong technical support for the digital transformation of power grids.
Claims
1. A method for fusing heterogeneous power grid data, characterized in that: Includes the following steps: S1. Receive raw data from multiple heterogeneous data sources, including SCADA system, load management system, power distribution automation system, marketing system, GIS system and meteorological system; S2. Automatically identify each heterogeneous data source, including: Sample data is extracted from the original data, and the number of sample data is 100. Extract structural features, content features, semantic features, temporal features, and correlation features from sample data; The extracted features are quantized and encoded to generate feature vectors; The feature vector is matched with the pre-stored data source template feature vector for similarity. When the similarity is greater than 0.75, the data source type is determined. S3. Extract data schema information from the data source, including field names, field types, field constraints, and field semantics; S4. Based on the data pattern information and preset conversion rules, perform standardization conversion on the original data. The standardization conversion includes field mapping, type conversion, value range conversion, unit conversion, and format conversion to generate standardized data. S5. Perform fusion processing on standardized data from different data sources, including: Extract entity identification features from data records; The overall similarity between different data records is calculated in the following way: ; in, Indicates the overall similarity. Indicates the similarity of identifiers. , , These represent the corresponding weight coefficients. The value range is from 0.3 to 0.
5. The value range is from 0.3 to 0.
5. The value range is from 0.1 to 0.3, and, + + =1, calculated using the edit distance algorithm, the formula is: ; wherein, denotes the edit distance of two strings, and denotes the length of two strings, respectively. The attribute similarity is represented by the cosine similarity algorithm, and the calculation formula is as follows: ; in, and These represent the attribute vectors of the two records, and · represents the vector dot product. and These represent the magnitudes of the vectors, respectively. denotes the spatio-temporal correlation, which is a combination of the temporal correlation and the spatial correlation calculation, and the calculation formula is: ; wherein, denotes a temporal correlation, computed from the timestamp difference, denotes a spatial correlation, computed from the geographical distance; Based on the comprehensive similarity, the data records are clustered into entities, and data records with similarity greater than a preset threshold are clustered into the same entity. Perform conflict detection on multiple data records belonging to the same entity; When a conflict is detected, the credibility of each data source is calculated, and a conflict resolution strategy is applied based on the credibility. S6. Organize and store the merged data according to a unified data model; S7. Provide integrated data access services to external parties through the data service interface.
2. The method of claim 1, wherein: The similarity matching process in S2 includes: Extract structural features, including the number of fields, field names, field types, and nesting levels; Extract content features, including data type, numerical range, character length distribution, null value rate, and number of unique values; Semantic features are extracted and matched with a dictionary of terms in the power grid field to identify the business meaning of the fields; Extract time-series features, including the format, granularity, range, and sampling frequency of the time field; Extract association features, including correlation coefficients between fields and primary / foreign key relationships; After normalizing the above features, a fixed-dimensional feature vector is generated, with the dimension of the feature vector being 100 to 200.
3. The method of claim 1, wherein: The conflict resolution strategies in S5 include: Credibility-first strategy: Select the data source with the highest credibility value as the fusion result; Weighted averaging strategy: For numerical conflicts, the fusion value is calculated using the following formula: ; in, Indicates the fusion value. Indicates the number of data sources. Indicates the first The credibility of each data source Indicates the first The values from each data source; Majority voting strategy: For enumerated conflicts, select the value with the highest frequency of occurrence as the fusion result; Timeliness-first strategy: Select values from data sources with the most recent update time as the fusion result.
4. The method of claim 1, wherein: In step S5, for time-series data containing timestamps, a time-series alignment operation is performed before fusion processing, including: Establish a unified time base and time granularity; The timestamps from various data sources are standardized to unify time zones and time formats; Identify missing time points in the baseline time series; For missing time points, interpolation methods are used to estimate data, including linear interpolation, spline interpolation, and forward padding. Data with a sampling frequency higher than the baseline granularity is downsampled, and the original data is aggregated within each baseline time interval; Synchronize and align the time-series data from each data source according to the baseline time series.
5. A fusion processing component for heterogeneous power grid data, used to implement the method of claim 1, characterized in that, include: The data acquisition module is used to collect raw data from multiple heterogeneous data sources, including SCADA systems, load management systems, power distribution automation systems, marketing systems, GIS systems, and meteorological systems. The data source identification module is used to automatically identify heterogeneous data sources. It includes a feature extraction unit, a feature vector generation unit, and a template matching unit. The feature extraction unit is used to extract structural features, content features, semantic features, temporal features, and correlation features of the data. The feature vector generation unit is used to quantize and encode the extracted features into feature vectors of fixed dimensions. The template matching unit is used to perform similarity matching between the feature vectors and pre-stored data source templates. When the similarity is greater than 0.75, the data source type is determined. The schema extraction module is used to extract data schema information from the data source, including field names, field types, field constraints, and field semantics, and to generate a schema description document. The data conversion module is used to perform standardized conversion on the original data according to the data pattern information and preset conversion rules. The standardized conversion includes field mapping, type conversion, value range conversion, unit conversion and format conversion. The data fusion module is used to fuse standardized data from different data sources, including a similarity calculation unit, an entity clustering unit, a conflict detection unit, and a conflict resolution unit. The data service module is used to organize and store the merged data according to a unified data model, and to provide data access services to the outside world through standardized interfaces.
6. The fusion processing component for heterogeneous power grid data according to claim 5, characterized in that: The data source identification module also includes: The metadata management unit is used to store the feature vectors and pattern description documents of the data source templates, establish the mapping relationship between the data source identifier and the pattern description, and manage the version of the pattern description documents.
7. The power grid heterogeneous data fusion processing component of claim 5, wherein: The data conversion module includes: The rules engine is used to load, parse, and execute transformation rules, which include field mapping rules, type conversion rules, value domain conversion rules, unit conversion rules, format conversion rules, and calculation derivation rules. A mapping converter is used to perform data transformation operations based on the rules parsed by the rule engine. Data cleaners are used to clean and process transformed data, including handling null values, removing duplicate values, correcting outliers, and normalizing formats. A transformation logger is used to record operation information, transformation results, and exception information during the transformation process.
8. The power grid heterogeneous data fusion processing component of claim 5, wherein: The conflict resolution unit selects a conflict resolution strategy based on the data source credibility and conflict type. The conflict resolution strategies include credibility priority strategy, weighted average strategy, majority voting strategy, and timeliness priority strategy.