Multi-source data fusion analysis method and device, electronic equipment and storage medium
By constructing a virtualized metadata catalog that classifies and unifies the semantic representation of multi-source heterogeneous data, the problems of redundant data storage and poor scalability in traditional ETL are solved, and efficient and low-cost data fusion analysis is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENHUA RAIL & FREIGHT WAGONS TRANSPORT
- Filing Date
- 2026-02-13
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional ETL (Extract, Transform, Load) technology results in the repeated storage of similar data, poor scalability, high maintenance costs, and requires manual reconstruction of the ETL pipeline and configuration of mapping rules.
By classifying multi-source heterogeneous data into time-series dynamic streaming data and time-series static data, implementing time-slicing windowing processing and structure transformation, and constructing a virtualized metadata catalog based on unified semantic representation, data copying and duplicate storage are avoided. Domain ontology is used to construct mapping rules, supporting dynamic expansion and automated management.
It reduces data storage redundancy, improves processing efficiency and scalability, ensures the accuracy and reliability of data analysis, reduces maintenance costs, and solves the problem of data silos.
Smart Images

Figure CN122240702A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a method, apparatus, electronic device and storage medium for multi-source data fusion analysis. Background Technology
[0002] In the field of data processing technology, multi-source heterogeneous data fusion analysis has become the core support for intelligent decision-making. Heterogeneous data usually refers to data with different structures, formats, and storage methods. Multi-source heterogeneous data refers to heterogeneous data from multiple different sources.
[0003] In related technologies, traditional multi-source data integration (ETL) technology extracts raw data from various business systems, then cleans, standardizes, and removes redundant fields based on predefined rules, and finally stores the processed data in a centralized data warehouse.
[0004] However, traditional ETL relies on physical data replication mechanisms, which leads to the repeated storage of similar data across multiple systems, resulting in high redundancy. Furthermore, for new data sources, the ETL pipeline needs to be manually reconstructed and mapping rules configured, resulting in poor scalability and high maintenance costs. Summary of the Invention
[0005] This application provides at least one method, apparatus, electronic device, and storage medium for multi-source data fusion analysis, aiming to solve one of the aforementioned technical problems.
[0006] To address the aforementioned technical problems, at least one embodiment of this application provides a multi-source data fusion analysis method, comprising: acquiring multi-source heterogeneous raw data and classifying the multi-source heterogeneous raw data into time-series dynamic stream data and time-series static data; performing time-slice windowing processing on the time-series dynamic stream data and caching the processed time-series data fragments in a memory acceleration channel; converting the time-series static data into a structured multi-dimensional data cube and performing data cleaning processing, storing the processed multi-dimensional data cube in a distributed archive repository; constructing a virtualized metadata directory for the data in the time-series data fragments and the data corresponding to the multi-dimensional data cube, wherein the virtualized metadata directory records the metadata of the data based on a unified semantic representation, so as to perform data analysis processing based on the virtualized metadata directory.
[0007] By constructing a virtualized metadata catalog based on unified semantic representation, the traditional ETL process of copying and standardizing data formats is avoided, thus preventing the repeated storage of similar data and reducing data storage redundancy. Furthermore, recording data metadata in the virtualized metadata catalog eliminates the need to build ETL pipelines and configure mapping rules for each data source. New data sources can be directly recorded in the virtualized metadata catalog, making this approach more scalable and less costly to maintain for data fusion and analysis.
[0008] Furthermore, by dividing the multi-source heterogeneous raw data into time-series dynamic streaming data and time-series static data, a foundation is laid for implementing differentiated processing strategies for data with different characteristics, thereby improving overall processing efficiency. Applying time-slicing windowing processing to dynamic streaming data and caching it in memory helps ensure low-latency access and computation of real-time data; structuring static data and storing it in a distributed warehouse helps support efficient storage and batch analysis of massive historical data. By constructing a virtualized metadata catalog based on unified semantic representation, a unified logical view and access interface are provided for correlation queries and fusion analysis of data from different sources (streaming and static) and with different structures, solving the data silo problem.
[0009] In some optional embodiments, the time-slicing windowing process for the time-series dynamic stream data includes: for each data stream in the time-series dynamic stream data, dividing the data stream into a periodic data stream or a session data stream according to the time interval between the generation of adjacent data in the data stream; segmenting the periodic data stream using a scrolling window or a sliding window; and segmenting the session data stream from the target adjacent data as target adjacent data, taking adjacent data in the session data stream whose generation time interval is greater than an interval threshold as target adjacent data.
[0010] By identifying the inherent patterns (periodic or session-based) of data streams and adapting different windowing strategies (such as scrolling windows for fixed-period aggregation statistics and session windows for event-interval-based behavioral analysis), time slices are made more closely aligned with the actual business scenarios in which the data is generated, thereby improving the accuracy of subsequent aggregation or analysis results.
[0011] In some optional embodiments, constructing a virtualized metadata catalog for the data in the time-series data segment and the data corresponding to the multidimensional data cube includes: mapping the metadata of the data in the time-series data segment and the metadata of the data corresponding to the multidimensional data cube based on mapping rules, so as to provide a semantically unified representation of the metadata of different data, wherein the mapping rules are constructed based on domain ontology.
[0012] By introducing a global semantic framework based on domain ontology to drive the generation of mapping rules, the unified representation of data semantics possesses clear business meaning and logical consistency, avoiding ambiguity caused by arbitrary mapping and improving the quality and reliability of the virtualized metadata catalog. Because the global semantic framework defines data entities and their relationships, the constructed metadata catalog not only includes field mappings but also reflects the business logic relationships between data, thereby supporting more complex, semantic-based relational queries and reasoning analysis.
[0013] In some optional embodiments, the step of mapping the metadata of the data in the time-series data segment and the metadata of the corresponding data in the multidimensional data cube based on the mapping rules includes: identifying whether there are data from different data sources whose metadata has the same field name; if there are data from different data sources whose metadata has the same field name, then reconstructing the same field name from different data sources to obtain a reconstructed field name corresponding to each same field name, wherein the reconstructed field names corresponding to the same field name from different data sources are different; and mapping the metadata of the data in the time-series data segment and the metadata of the corresponding data in the multidimensional data cube based on the reconstructed field name and the mapping rules.
[0014] By generating different reconstructed field names for the same field names from different sources, it is possible to clearly distinguish and preserve their original semantic context while maintaining a unified semantic representation. This fundamentally avoids incorrect mapping and data confusion caused by identical field names, improves the adaptability and construction accuracy of the virtualized metadata catalog when facing complex and chaotic real data sources, and ensures the reliability of query and analysis results based on the virtualized metadata catalog.
[0015] In some optional embodiments, the method further includes: in response to detecting a new data source among the multiple data sources corresponding to the multi-source heterogeneous original data, acquiring metadata corresponding to the new data source; determining the similarity between each field in the metadata corresponding to the new data source and each value in the mapping rule, obtaining multiple similarities corresponding to each field; selecting the maximum value from the multiple similarities corresponding to each field to obtain the maximum similarity corresponding to each field; when the maximum similarity is greater than a first threshold, adding a mapping relationship between the field and value corresponding to the maximum similarity to the mapping rule; when the maximum similarity is greater than a second threshold but not greater than the first threshold, generating and displaying a mapping generation query including the maximum similarity and the corresponding field and value; in response to receiving feedback indicating agreement to generate the mapping relationship in response to the mapping generation query, adding a mapping relationship between the field and value corresponding to the maximum similarity to the mapping rule; in response to receiving feedback indicating disagreement to generate the mapping relationship in response to the mapping generation query, adding a mapping rule from the field to the corresponding general description to the mapping rule; when the maximum similarity is not greater than the second threshold, adding a mapping rule from the field to the corresponding general description to the mapping rule.
[0016] By automating the monitoring and processing of new data sources and their field mappings, the virtualized metadata catalog can dynamically expand to accommodate new data sources without service interruption or manual intervention. By setting multi-level thresholds (such as automatic mapping for high confidence, manual confirmation for medium confidence, and categorization into general descriptions for low confidence), while pursuing automation efficiency, a manual review process is introduced for uncertain mappings. A degradation solution (general description) is provided when uncertainty arises, forming a human-machine collaborative rule iteration mechanism that ensures the quality control of the virtualized metadata catalog during dynamic expansion.
[0017] In some optional embodiments, the data analysis processing includes data querying. The process of performing data analysis processing based on the virtualized metadata catalog includes: receiving a data query request for querying at least one piece of data; when the at least one piece of data includes multi-field data, determining multiple query paths for querying the at least one piece of data, wherein the multi-field data is data belonging to multiple fields; calculating the cost value of each query path according to the metadata corresponding to the at least one piece of data in the virtualized metadata catalog, and selecting the query path corresponding to the lowest cost value as the target query path; performing data query based on the target query path to obtain the query result of the query request.
[0018] By leveraging global metadata (such as data location, size, statistics, and relationships) aggregated in a virtualized metadata catalog to evaluate the cost of each query path (such as computational and data transfer volume) and selecting the path with the lowest cost, unnecessary computational overhead and network transmission can be directly reduced, thereby effectively reducing query response time and improving resource utilization efficiency.
[0019] At least one embodiment of this application also provides a multi-source data fusion analysis apparatus, comprising: an acquisition module for acquiring multi-source heterogeneous raw data and classifying the multi-source heterogeneous raw data into time-series dynamic stream data and time-series static data; a first processing module for performing time-slice windowing processing on the time-series dynamic stream data and caching the processed time-series data fragments in a memory acceleration channel; a second processing module for converting the time-series static data into a structured multi-dimensional data cube and performing data cleaning processing, storing the processed multi-dimensional data cube in a distributed archive repository; and a construction module for constructing a virtualized metadata directory for the data in the time-series data fragments and the data corresponding to the multi-dimensional data cube, wherein the virtualized metadata directory records the metadata of the data based on a unified semantic representation, so as to perform data analysis processing based on the virtualized metadata directory.
[0020] In some optional embodiments, the first processing module is configured to, for each data stream in the time-series dynamic streaming data, divide the data stream into a periodic data stream or a session data stream according to the time interval between the generation of adjacent data in the data stream; segment the periodic data stream using a scrolling window or a sliding window; and segment the session data stream from the target adjacent data as target adjacent data, taking adjacent data in the session data stream whose generation time interval is greater than an interval threshold as target adjacent data.
[0021] In some optional embodiments, the construction module is used to map the metadata of the data in the time-series data segment and the metadata of the corresponding data in the multidimensional data cube based on mapping rules, so as to provide a semantically unified representation of the metadata of different data. The mapping rules are constructed based on the domain ontology.
[0022] In some optional embodiments, the construction module is used to identify whether there are identical field names in the metadata of the data in the time-series data segment and the metadata of the data corresponding to the multidimensional data cube from different data sources; if identical field names are found in the metadata of the data from different data sources, the identical field names from different data sources are reconstructed to obtain reconstructed field names corresponding to each identical field name, and the reconstructed field names corresponding to the identical field names from different data sources are different; based on the reconstructed field names and the mapping rules, the metadata of the data in the time-series data segment and the metadata of the data corresponding to the multidimensional data cube are mapped respectively.
[0023] In some optional embodiments, the system further includes: an update module, configured to: in response to detecting a new data source among multiple data sources corresponding to the multi-source heterogeneous original data; obtain metadata corresponding to the new data source; determine the similarity between each field in the metadata corresponding to the new data source and each value in the mapping rule, obtaining multiple similarities for each field; select the maximum value from the multiple similarities for each field, obtaining the maximum similarity for each field; when the maximum similarity is greater than a first threshold, add a mapping relationship between the field and value corresponding to the maximum similarity to the mapping rule; when the maximum similarity is greater than a second threshold but not greater than the first threshold, generate and display a mapping generation query including the maximum similarity and the corresponding field and value; in response to receiving feedback indicating agreement to generate the mapping relationship in response to the mapping generation query, add a mapping relationship between the field and value corresponding to the maximum similarity to the mapping rule; in response to receiving feedback indicating disagreement to generate the mapping relationship in response to the mapping generation query, add a mapping rule from the field to the corresponding general description to the mapping rule; when the maximum similarity is not greater than the second threshold, add a mapping rule from the field to the corresponding general description to the mapping rule.
[0024] In some optional embodiments, the system further includes: a query module, configured to receive a data query request for querying at least one piece of data; when the at least one piece of data includes multi-field data, determine multiple query paths for querying the at least one piece of data, wherein the multi-field data is data belonging to multiple fields; calculate the cost value of each query path based on the metadata corresponding to the at least one piece of data in the virtualized metadata directory, and select the query path corresponding to the lowest cost value as the target query path; and perform a data query based on the target query path to obtain the query result of the query request.
[0025] At least one embodiment of this application also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the multi-source data fusion analysis method described above.
[0026] At least one embodiment of this application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described multi-source data fusion analysis method.
[0027] At least one embodiment of this application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described multi-source data fusion analysis method. Attached Figure Description
[0028] One or more embodiments are illustrated by way of example with reference to the accompanying drawings, and these illustrative descriptions do not constitute a limitation on the embodiments.
[0029] Figure 1 This is a flowchart of a multi-source data fusion analysis method provided in one embodiment of this application; Figure 2 This is a schematic diagram of a multi-source data fusion analysis device provided in another embodiment of this application; Figure 3 This is a schematic diagram of the structure of an electronic device provided in another embodiment of this application. Detailed Implementation
[0030] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the various embodiments of this application will be described in detail below with reference to the accompanying drawings. However, those skilled in the art will understand that many technical details have been provided in the various embodiments of this application to help readers better understand this application. However, the technical solutions claimed in this application can be implemented even without these technical details and various changes and modifications based on the following embodiments. The division of the various embodiments below is for the convenience of description and should not constitute any limitation on the specific implementation of this application. The various embodiments can be combined with and referenced by each other without contradiction.
[0031] Example 1: The multi-source data fusion analysis method of this embodiment can be applied to electronic devices with communication, computing, and data storage capabilities. Its specific process can be as follows: Figure 1 As shown, it includes: S101-S104.
[0032] S101, acquire multi-source heterogeneous raw data, and classify the multi-source heterogeneous raw data into time-series dynamic stream data and time-series static data.
[0033] Heterogeneous data refers to data that differs in at least one of the following: structure, format, or storage method. Multi-source heterogeneous data refers to heterogeneous data from multiple different sources. The embodiments of this application do not limit the specific types of data included in multi-source heterogeneous data.
[0034] Time-series dynamic streaming data are data sequences that are generated in real time, are continuous, and change dynamically over time. Examples include continuous session data and continuously generated sensor data.
[0035] Time-series static data is time-series data that is generated within a specific time period, remains unchanged, and can be accessed repeatedly. Once generated, the data is not updated, possessing stability and historical nature. Examples include historical order data.
[0036] The embodiments of this application do not limit how to obtain multi-source heterogeneous raw data. In some examples, the electronic device is configured with multiple interfaces, and data from different data sources is continuously obtained through different interfaces to obtain the multi-source heterogeneous raw data.
[0037] The embodiments of this application do not limit how multi-source heterogeneous raw data is classified into time-series dynamic streaming data and time-series static data. In some examples, classifying multi-source heterogeneous raw data into time-series dynamic streaming data and time-series static data can include: classifying multi-source heterogeneous raw data into time-series dynamic streaming data and time-series static data based on the interface corresponding to each data source. Since the data generated by the data source is time-series dynamic streaming data or time-series static data, and the data received from the interface after configuring the corresponding data source is time-series dynamic streaming data or time-series static data.
[0038] In other examples, classifying multi-source heterogeneous raw data into time-series dynamic streaming data and time-series static data can include: using the respective data formats of the multi-source heterogeneous raw data to classify the multi-source heterogeneous raw data into time-series dynamic streaming data and time-series static data.
[0039] By analyzing the structural features, time field characteristics, and data update methods of multi-source heterogeneous raw data, the system identifies whether it has the characteristics of continuous generation, incremental appending, or real-time push, thereby classifying it as time-series dynamic stream data or time-series static data. When accessing the data source, the system first performs data structure parsing and extracts the following structural features, such as whether it is in stream form or in dataset form.
[0040] Additionally, it can identify whether time-related fields exist in the data and analyze their semantic roles, whether they are strongly bound to the data generation, and whether there are incremental timestamps. These can all serve as criteria for distinguishing between dynamic and static streaming data. S102 performs time-slice windowing processing on the time-series dynamic stream data and caches the processed time-series data fragments in the memory acceleration channel.
[0041] Among them, time-slice windowing is a process of dividing continuous time-series dynamic stream data into windows.
[0042] In some examples, time-slicing windowing processing of time-series dynamic streaming data may include: for each data stream in the time-series dynamic streaming data, dividing the data stream into periodic data streams or session data streams based on the time interval between the generation of adjacent data in the data stream; using a rolling window or sliding window to segment the periodic data stream; and using adjacent data in the session data stream whose generation time interval is greater than the interval threshold as target adjacent data, and segmenting the session data stream from the target adjacent data.
[0043] By identifying the inherent patterns (periodic or session-based) of data streams and adapting different windowing strategies (such as scrolling windows for fixed-period aggregation statistics and session windows for event-interval-based behavioral analysis), time slices are made more closely aligned with the actual business scenarios in which the data is generated, thereby improving the accuracy of subsequent aggregation or analysis results.
[0044] A periodic data stream is used to represent data that is generated periodically at a certain frequency. For example, a sensor periodically generates sampling data at a certain sampling frequency; the sampling data generated by the sensor is a periodic data stream.
[0045] Session data streams are used to represent data that is not generated at a fixed frequency, but rather in the form of sessions, generated as the session begins. For example, conversation logs generate session data when the parties in a conversation send a message.
[0046] In some examples, the data stream is classified into periodic data streams or session data streams based on the time interval between the generation of adjacent data in the data stream. This can include: if the time interval between the generation of adjacent data in the data stream remains stable (the difference between different time intervals is less than a preset value), then the data stream is classified as a periodic data stream; if the time interval between the generation of adjacent data in the data stream is unstable (the difference between different time intervals has a time difference not less than a preset value), then the data stream is classified as a session data stream.
[0047] The embodiments of this application do not limit the specific value of the preset value. For example, the preset value may be 3 minutes, 10 minutes, etc.
[0048] In some cases, when using a rolling window to segment a periodic data stream, a waterline-based out-of-order repair control mechanism can be applied to set a delay threshold, so that the window waits for late data to arrive before closing.
[0049] In some cases, when using a sliding window to segment a periodic data stream, it is possible to configure the data covered by the sliding window to overlap between adjacent segments.
[0050] The session data stream is segmented by using adjacent data whose generation time interval is greater than the interval threshold as target adjacent data. The purpose is to identify the interval between two adjacent dialogue processes in the session data so as to segment the session data stream from that interval.
[0051] In some cases, fault tolerance management can be implemented during time-slicing windowing of time-series dynamic streaming data. For example, data can be backed up periodically, and the backup data can be used for data recovery in the event of a device failure.
[0052] By caching time-series dynamic stream data through a memory acceleration channel, low-latency data read and write can be guaranteed.
[0053] S103 converts time-series static data into structured multidimensional data cubes and performs data cleaning processing, then stores the processed multidimensional data cubes in a distributed archive repository.
[0054] A piece of data can typically have multiple fields (labels). For example, the fields to which Xiaoming (name data) belongs can include: third grade (that is, you can find "Xiaoming" under the field "third grade"), primary school student (that is, you can find "Xiaoming" under the field "primary school student"), etc.
[0055] In a multidimensional data cube, "multidimensional" refers to the multiple fields to which the data belongs. A multidimensional data cube organizes data into a multidimensional array of dimensions and measures, enabling rapid interactive querying, summarization, and analysis of the data from multiple perspectives and levels.
[0056] The embodiments of this application do not limit the methods for data cleaning. In some examples, the data cleaning process may include deduplication, outlier removal, missing value imputation, and noise reduction.
[0057] A distributed archive repository is a type of distributed storage system. In some cases, when multidimensional data cubes are stored in a distributed archive repository, columnar storage optimization and compressed storage are performed. Data in the same column usually belongs to the same field and has a similar data format, which facilitates data compression and storage.
[0058] S104. Construct a virtualized metadata directory for the data in the time-series data segment and the corresponding data in the multi-dimensional data cube. The metadata of the data is recorded in the virtualized metadata directory based on a unified semantic representation, so as to perform data analysis and processing based on the virtualized metadata directory.
[0059] In some examples, virtualized metadata directories are constructed for the data in the time-series data segments and the data corresponding to the multidimensional data cubes. This includes mapping the metadata of the data in the time-series data segments and the metadata of the data corresponding to the multidimensional data cubes based on mapping rules, so as to provide a semantically unified representation of the metadata of different data. The mapping rules are constructed based on the domain ontology.
[0060] Domain ontology is a standardized knowledge model used to describe the core concepts, attributes and their interrelationships in a specific business domain (such as finance and transportation), providing an authoritative benchmark and logical constraints for the unification of data semantics.
[0061] Constructing mapping rules using domain ontology can be achieved by using entities in the domain ontology as the basis for semantically identical representation (converting all other representations into the basis form). By introducing mapping rule generation based on domain ontology, the unified representation of data semantics possesses clear business meaning and logical consistency, avoiding ambiguity caused by arbitrary mapping and improving the quality and reliability of the virtualized metadata catalog. Since the domain ontology defines data entities and their relationships, the constructed metadata catalog not only includes field mappings but also reflects the business logic relationships between data, thus supporting more complex, semantic-based relational queries and reasoning analysis.
[0062] If different data have the same field name in their metadata but different meanings, and mapping is performed directly according to the mapping rules, then in the virtualization metadata directory, fields with the same name for different data will be mapped to the same representation, resulting in mapping errors.
[0063] In some examples, based on mapping rules, the metadata of the data in the time-series data segment and the metadata of the corresponding data in the multidimensional data cube are mapped separately. This may include: identifying whether there are data from different data sources whose metadata has the same field name; if there are data from different data sources whose metadata has the same field name, then the same field name from different data sources is reconstructed to obtain the reconstructed field name corresponding to each same field name, and the reconstructed field names corresponding to the same field name from different data sources are different; based on the reconstructed field name and the mapping rules, the metadata of the data in the time-series data segment and the metadata of the corresponding data in the multidimensional data cube are mapped separately.
[0064] By directly comparing whether there are fields with the same field name in the metadata corresponding to different data, it can be determined whether there are data from different data sources with the same field name in the metadata.
[0065] The embodiments of this application do not limit how to reconstruct the same field name under different data sources. In some examples, the same field name of the data and the table name of the table to which the data belongs can be combined, and the combined result can be used as the reconstructed field name of the same field name.
[0066] In other examples, a reconstruction model can be used to identify the semantics of data with the same field name, and the same field name can be reconstructed based on the semantics of the data to obtain the reconstructed field names of each data.
[0067] In some examples, electronic devices store a refactored field name table, which contains several records of multiple field groups, each field group including multiple field names, where the multiple field names in the field group are field names that are easy to apply the same field name to the data.
[0068] After the reconstruction model performs semantic recognition on the data, it can combine the semantics of each field in the reconstruction field name table and select the field name that is closest in semantics to the "same field name" in the data as the reconstruction field name.
[0069] The embodiments of this application do not limit the specific type of reconstructed model. For example, the reconstructed model can be a semantic encoding model based on Transformer.
[0070] By generating different reconstructed field names for the same field names from different sources, it is possible to clearly distinguish and preserve their original semantic context while maintaining a unified semantic representation. This fundamentally avoids incorrect mapping and data confusion caused by identical field names, improves the adaptability and construction accuracy of the virtualized metadata catalog when facing complex and chaotic real data sources, and ensures the reliability of query and analysis results based on the virtualized metadata catalog.
[0071] In some examples, in response to detecting a new data source among multiple data sources corresponding to multi-source heterogeneous raw data, the electronic device acquires the metadata corresponding to the new data source; determines the similarity between each field in the metadata corresponding to the new data source and each value in the mapping rule (the mapping rule is a key-to-value mapping rule), obtaining multiple similarities for each field; for each field, the maximum value is selected from the multiple similarities to obtain the maximum similarity for each field; when the maximum similarity is greater than a first threshold, a mapping relationship between the field and value corresponding to the maximum similarity is added to the mapping rule; when the maximum similarity is greater than a second threshold but not greater than the first threshold, a mapping generation query including the maximum similarity and the corresponding field and value is generated and displayed; in response to receiving feedback indicating agreement to generate the mapping relationship in response to the mapping generation query, a mapping relationship between the field and value corresponding to the maximum similarity is added to the mapping rule; in response to receiving feedback indicating disagreement to generate the mapping relationship in response to the mapping generation query, a mapping rule from the field to the corresponding general description is added to the mapping rule; when the maximum similarity is not greater than the second threshold, a mapping rule from the field to the corresponding general description is added to the mapping rule.
[0072] The embodiments of this application do not limit the specific values of the first threshold and the second threshold, wherein the second threshold is less than the first threshold.
[0073] In some cases, electronic devices store a general description table that records multiple general descriptions. If a mapping rule is added to the mapping rules to map a field to the corresponding general description, it can include: based on semantic similarity, determining the maximum similarity among multiple general descriptions to the field, and using the general description corresponding to the maximum similarity as the general description corresponding to the field.
[0074] By automating the monitoring and processing of new data sources and their field mappings, the virtualized metadata catalog can dynamically expand to accommodate new data sources without service interruption or manual intervention. By setting multi-level thresholds (such as automatic mapping for high confidence, manual confirmation for medium confidence, and categorization into general descriptions for low confidence), while pursuing automation efficiency, a manual review process is introduced for uncertain mappings. A degradation solution (general description) is provided when uncertainty arises, forming a human-machine collaborative rule iteration mechanism that ensures the quality control of the virtualized metadata catalog during dynamic expansion.
[0075] In some examples, data analysis processing includes data querying. The process of data analysis processing based on the virtualized metadata catalog may include: receiving a data query request for querying at least one piece of data; when the at least one piece of data includes multi-field data, determining multiple query paths for querying the at least one piece of data, where multi-field data refers to data belonging to multiple fields; calculating the cost value of each query path based on the metadata corresponding to the at least one piece of data in the virtualized metadata catalog, and selecting the query path corresponding to the lowest cost value as the target query path; and performing data query based on the target query path to obtain the query results of the query request.
[0076] By leveraging global metadata (such as data location, size, statistics, and relationships) aggregated in a virtualized metadata catalog to evaluate the cost of each query path (such as computational and data transfer volume) and selecting the path with the lowest cost, unnecessary computational overhead and network transmission can be directly reduced, thereby effectively reducing query response time and improving resource utilization efficiency.
[0077] In some cases, the cost of a query path can be calculated using the following formula: Cost = α × Network traffic + β × CPU computation + γ × I / O latency + λ × Penalty Where Cost is the cost; network transmission volume is the amount of data to be transmitted according to the query path; CPU computation volume is the computing resources used by the CPU to process the query request. The more complex the query statement in the query request, the greater the CPU computation; I / O latency is the latency required to query and obtain data according to the execution path; the penalty term is used to punish network latency. The greater the network latency, the greater the value of the penalty term and the greater the corresponding cost; α, β, γ and λ are all preset coefficients that can be set based on experience.
[0078] Example 2: Another embodiment of this application relates to a multi-source data fusion analysis device. The implementation details of this embodiment's multi-source data fusion analysis device are described below. The following implementation details are provided for ease of understanding and are not essential for implementing this solution. A schematic diagram of this embodiment's multi-source data fusion analysis device can be seen as follows: Figure 2As shown, it includes: an acquisition module 21, used to acquire multi-source heterogeneous raw data and classify the multi-source heterogeneous raw data into time-series dynamic stream data and time-series static data; a first processing module 22, used to perform time-slice windowing processing on the time-series dynamic stream data and cache the processed time-series data fragments in a memory acceleration channel; a second processing module 23, used to convert the time-series static data into a structured multi-dimensional data cube and perform data cleaning processing, and store the processed multi-dimensional data cube in a distributed archive repository; and a construction module 24, used to construct a virtualized metadata directory for the data in the time-series data fragments and the data corresponding to the multi-dimensional data cube, wherein the metadata of the data is recorded in the virtualized metadata directory based on a unified semantic representation, so as to perform data analysis processing based on the virtualized metadata directory.
[0079] By constructing a virtualized metadata catalog based on unified semantic representation, the traditional ETL process of copying and standardizing data formats is avoided, thus preventing the repeated storage of similar data and reducing data storage redundancy. Furthermore, recording data metadata in the virtualized metadata catalog eliminates the need to build ETL pipelines and configure mapping rules for each data source. New data sources can be directly recorded in the virtualized metadata catalog, making this approach more scalable and less costly to maintain for data fusion and analysis.
[0080] Furthermore, by dividing the multi-source heterogeneous raw data into time-series dynamic streaming data and time-series static data, a foundation is laid for implementing differentiated processing strategies for data with different characteristics, thereby improving overall processing efficiency. Applying time-slicing windowing processing to dynamic streaming data and caching it in memory helps ensure low-latency access and computation of real-time data; structuring static data and storing it in a distributed warehouse helps support efficient storage and batch analysis of massive historical data. By constructing a virtualized metadata catalog based on unified semantic representation, a unified logical view and access interface are provided for correlation queries and fusion analysis of data from different sources (streaming and static) and with different structures, solving the data silo problem.
[0081] In some optional embodiments, the first processing module 21 is used to divide each data stream in the time-series dynamic streaming data into a periodic data stream or a session data stream according to the time interval between the generation of adjacent data in the data stream; to segment the periodic data stream using a scrolling window or a sliding window; and to segment the session data stream from the target adjacent data by taking the adjacent data in the session data stream whose generation time interval is greater than the interval threshold as target adjacent data.
[0082] In some optional embodiments, the construction module 24 is used to map the metadata of data in time-series data segments and the metadata of data corresponding to multi-dimensional data cubes based on mapping rules, so as to provide a semantically unified representation of the metadata of different data. The mapping rules are constructed based on the domain ontology.
[0083] In some optional embodiments, the construction module 24 is used to identify whether there are identical field names in the metadata of the data in the time-series data segment and the metadata of the data corresponding to the multidimensional data cube from different data sources; if identical field names are found in the metadata of the data from different data sources, the identical field names from different data sources are reconstructed to obtain reconstructed field names corresponding to each identical field name, and the reconstructed field names corresponding to the identical field names from different data sources are different; based on the reconstructed field names and mapping rules, the metadata of the data in the time-series data segment and the metadata of the data corresponding to the multidimensional data cube are mapped respectively.
[0084] In some optional embodiments, the system further includes: an update module 25, configured to: in response to detecting a new data source among multiple data sources corresponding to multi-source heterogeneous original data, acquire metadata corresponding to the new data source; determine the similarity between each field in the metadata corresponding to the new data source and each value in the mapping rule, and obtain multiple similarities corresponding to each field; for the multiple similarities corresponding to each field, select the maximum value from the multiple similarities to obtain the maximum similarity corresponding to each field; when the maximum similarity is greater than a first threshold, add a mapping relationship between the field and the value corresponding to the maximum similarity in the mapping rule; when the maximum similarity is greater than a second threshold but not greater than the first threshold, generate and display a mapping generation query including the maximum similarity and the corresponding field and value; in response to receiving feedback indicating agreement to generate a mapping relationship in response to the mapping generation query, add a mapping relationship between the field and the value corresponding to the maximum similarity in the mapping rule; in response to receiving feedback indicating disagreement to generate a mapping relationship in response to the mapping generation query, add a mapping rule from the field to the corresponding general description in the mapping rule; when the maximum similarity is not greater than the second threshold, add a mapping rule from the field to the corresponding general description in the mapping rule.
[0085] In some optional embodiments, the system further includes: a query module 26, configured to receive a data query request for querying at least one piece of data; when the at least one piece of data includes multi-field data, determine multiple query paths for querying at least one piece of data, wherein multi-field data refers to data belonging to multiple fields; calculate the cost value of each query path based on the metadata corresponding to at least one piece of data in the virtualized metadata directory, and select the query path corresponding to the lowest cost value as the target query path; and perform a data query based on the target query path to obtain the query result of the query request.
[0086] It is worth mentioning that all modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, a part of a physical unit, or a combination of multiple physical units. Furthermore, to highlight the innovative aspects of this application, this embodiment does not introduce units that are not closely related to solving the technical problems proposed in this application; however, this does not mean that other units are absent in this embodiment.
[0087] Example 3: Another embodiment of this application relates to an electronic device, such as... Figure 3 As shown, it includes: at least one processor 31; and a memory 32 communicatively connected to the at least one processor; wherein the memory 32 stores instructions executable by the at least one processor 31, which are executed by the at least one processor 31 to enable the at least one processor 31 to perform the multi-source data fusion analysis method in the above embodiments.
[0088] The memory 32 and processor 31 are connected via a bus, which can include any number of interconnecting buses and bridges, connecting various circuits of one or more processors 31 and memory 32. The bus can also connect various other circuits, such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver can be a single element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. Data processed by the processor is transmitted over the wireless medium via an antenna, which further receives data and transmits it to the processor.
[0089] Processor 31 is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. Memory can be used to store data used by the processor during operation.
[0090] Example 4: Another embodiment of this application relates to a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the method embodiments described above.
[0091] That is, those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. This program is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0092] Example 5: Another embodiment of this application relates to a computer program product, including a computer program. When executed by a processor, the computer program implements the method embodiments described above.
[0093] Those skilled in the art will understand that the above embodiments are specific embodiments for implementing this application, and in practical applications, various changes can be made to them in form and detail without departing from the spirit and scope of this application.
Claims
1. A multi-source data fusion analysis method, characterized in that, include: Acquire multi-source heterogeneous raw data and classify the multi-source heterogeneous raw data into time-series dynamic stream data and time-series static data; The time-series dynamic stream data is subjected to time-slice windowing processing, and the processed time-series data fragments are cached in the memory acceleration channel; The time-series static data is converted into a structured multidimensional data cube and cleaned. The resulting multidimensional data cube is then stored in a distributed archive repository. A virtualized metadata directory is constructed for the data in the time-series data segment and the data corresponding to the multidimensional data cube. The metadata of the data is recorded in the virtualized metadata directory based on a unified semantic representation, so as to perform data analysis and processing based on the virtualized metadata directory.
2. The method of claim 1, wherein, The step of performing time-slicing windowing processing on the time-series dynamic stream data includes: For each data stream in the time-series dynamic streaming data, the data stream is divided into periodic data stream or session data stream according to the time interval between the generation of adjacent data in the data stream; The periodic data stream is segmented using a scrolling window or a sliding window; adjacent data in the session data stream whose generation time interval is greater than an interval threshold are taken as target adjacent data, and the session data stream is segmented from the target adjacent data.
3. The method according to claim 1, characterized in that, The virtualized metadata directory for constructing the data in the time-series data segment and the data corresponding to the multidimensional data cube includes: Based on the mapping rules, the metadata of the data in the time-series data segment and the metadata of the corresponding data in the multi-dimensional data cube are mapped respectively, so as to provide a semantically unified representation of the metadata of different data. The mapping rules are constructed based on the domain ontology.
4. The method according to claim 3, characterized in that, The mapping based on mapping rules, which maps the metadata of the data in the time-series data segment to the metadata of the corresponding data in the multidimensional data cube, includes: Identify whether there are data from different data sources whose metadata contains the same field name in the metadata of the data in the time-series data segment and the metadata of the data corresponding to the multidimensional data cube; If data from different data sources contains the same field name in its metadata, then the same field name from different data sources is reconstructed to obtain the reconstructed field name corresponding to each same field name. The reconstructed field names corresponding to the same field name from different data sources are different. Based on the reconstructed field names and the mapping rules, the metadata of the data in the time-series data segment and the metadata of the corresponding data in the multidimensional data cube are mapped respectively.
5. The method according to claim 3, characterized in that, Also includes: In response to the detection that a new data source has been added among the multiple data sources corresponding to the multi-source heterogeneous original data, the metadata corresponding to the new data source is obtained; Determine the similarity between each field in the metadata corresponding to the new data source and each value in the mapping rule to obtain multiple similarities for each field; for each field, select the maximum value from the multiple similarities to obtain the maximum similarity for each field; When the maximum similarity is greater than a first threshold, a mapping relationship between the field and the value corresponding to the maximum similarity is added to the mapping rule; When the maximum similarity is greater than a second threshold and not greater than a first threshold, a mapping generation query including the maximum similarity and the corresponding fields and values is generated and displayed. In response to receiving feedback indicating agreement to generate the mapping relationship in response to the mapping generation query, the mapping relationship between the fields and values corresponding to the maximum similarity is added to the mapping rules. In response to receiving feedback indicating disagreement to generate the mapping relationship in response to the mapping generation query, a mapping rule for the field to the corresponding general description is added to the mapping rules. When the maximum similarity is not greater than the second threshold, the mapping rule for the field to the corresponding general description is added to the mapping rule.
6. The method according to claim 1, characterized in that, The data analysis and processing includes data querying, and the process of performing data analysis and processing based on the virtualized metadata catalog includes: Receive a data query request for retrieving at least one piece of data; When the at least one data includes multi-field data, multiple query paths for querying the at least one data are determined, wherein the multi-field data is data belonging to multiple fields; Based on the metadata corresponding to at least one piece of data in the virtualization metadata directory, calculate the cost value of each query path, and select the query path corresponding to the lowest cost value as the target query path; Data is queried based on the target query path to obtain the query results of the query request.
7. A multi-source data fusion analysis device, characterized in that, include: The acquisition module is used to acquire multi-source heterogeneous raw data and classify the multi-source heterogeneous raw data into time-series dynamic stream data and time-series static data. The first processing module is used to perform time-slice windowing processing on the time-series dynamic stream data and cache the processed time-series data fragments in the memory acceleration channel. The second processing module is used to convert the time-series static data into a structured multidimensional data cube and perform data cleaning processing, and store the processed multidimensional data cube in a distributed archive warehouse. A construction module is used to construct a virtualized metadata directory for the data in the time-series data segment and the data corresponding to the multidimensional data cube. The virtualized metadata directory records the metadata of the data based on a unified semantic representation, so as to perform data analysis and processing based on the virtualized metadata directory.
8. An electronic device, characterized in that, include: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the multi-source data fusion analysis method as described in any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the multi-source data fusion analysis method according to any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the multi-source data fusion analysis method according to any one of claims 1 to 6.