Visual ETL data processing method and device, electronic equipment and medium
By using a collaborative caching mechanism between the WASM container and browser memory on the browser side and integrating ETL services, the problem of server dependence in existing technologies is solved, and fully in-memory visualized data processing on the browser side is realized.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA MOBILE INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2022-02-17
- Publication Date
- 2026-06-30
AI Technical Summary
Existing ETL data processing methods rely on server resources and require front-end and back-end deployment, making it impossible to implement fully in-memory visualization application services on the browser side.
By using a collaborative caching mechanism between the WebAssembly (WASM) container and browser memory on the browser side, and integrating ETL atomic services, data processing and visualization can be achieved, reducing reliance on servers.
It enables fully in-memory visual application services on the browser side, reducing dependence on server resources and improving the efficiency and flexibility of data processing.
Smart Images

Figure CN116662687B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of the Internet and big data, and in particular to a visual ETL data processing method, apparatus, electronic device and medium. Background Technology
[0002] Current ETL data processing methods are entirely based on the three server-side data processing steps: data extraction (E), data cleaning and transformation (T), and data loading (L). The main ETL data processing methods currently include: ETL tool methods, SQL methods, and methods combining ETL tools and SQL. Interaction between the ETL service and the front-end JavaScript is typically limited to online input on the web interface, followed by information transmission to the backend for ETL data processing, and then the results are transmitted back to the frontend for visualization.
[0003] ETL services based on current mainstream ETL data processing methods often heavily rely on server resources and require front-end and back-end deployment to support users' visual ETL operations. Summary of the Invention
[0004] This application provides a visual ETL data processing method, apparatus, electronic device, and medium, which aims to reduce reliance on the server side and realize a fully in-memory visual application service on the browser side.
[0005] Firstly, this application provides a visual ETL data processing method, including:
[0006] The system identifies the data source to be processed and the ETL rules in the ETL visualization interface, and manages the configuration using the ETL rules and the data source to be processed to obtain browser data.
[0007] Determine the algorithm rule base to be used in the WASM container for the browser data;
[0008] The browser data is processed by ETL according to the algorithm rule base to be used to obtain the target display data, and the data volume of the target display data is determined.
[0009] Based on the data volume, the target display data is cached in browser memory and / or WASM memory, and the target display data cached in browser memory and / or WASM memory is visualized through the WASM container.
[0010] In one embodiment, the step of obtaining browser data through configuration management of the ETL rules and the data source to be processed includes:
[0011] The configuration parser of the websql based on the ETL rules and the WASM container manages, configures and parses the data source to be processed, obtains parsed data, and stores the parsed data in JS variables;
[0012] The WASM module is instantiated by calling a preset method in the WASM container, and the JS variable is passed as parameter data to the instantiated WASM module.
[0013] The configuration parser and the source protocol converter of the WASM container are used to parse, transform, and load the JS variables in the instantiated WASM module into memory to obtain the browser data.
[0014] The configuration parser based on the ETL rules in WebSQL and the WASM container manages, configures, and parses the data source to be processed, obtaining parsed data, including:
[0015] The websql is used to perform SQL-based management and configuration of the data source to be processed, thereby obtaining the configuration data to be processed.
[0016] The semantic parsing capability of the websql and the configuration parser are combined to perform the first configuration parsing on the configuration data to be processed, and the parsed data to be processed is obtained.
[0017] The semantic parsing data of the WebSQL and the configuration parser are combined to perform a second configuration parsing on the parsing data to be processed, thereby obtaining the parsed data.
[0018] The step of caching the target display data based on the data volume using browser memory and / or WASM memory includes:
[0019] If the amount of data is greater than the preset amount of data, the target display data is cached in a dual collaborative manner using the browser memory and the WASM memory;
[0020] If the data volume is less than or equal to the preset data volume, the target display data is cached using the browser memory or the WASM memory in a single collaborative manner.
[0021] The step of performing ETL data processing on the browser data according to the algorithm rule base to be used to obtain the target display data includes:
[0022] The algorithm rule base to be used and the browser data are loaded into the sandboxed execution environment running the WASM container for transformation and extraction to obtain the first data to be displayed;
[0023] The first data to be displayed is cleaned and repaired using a high-dimensional time series misalignment detection and repair algorithm to obtain the second data to be displayed.
[0024] The second data to be displayed is grouped to generate data objects for each group, and the data objects of each group are transformed using the built-in function library in the WASM container to obtain the transformation results of each group.
[0025] The main process aggregates the transformation results of each group to obtain the target display data.
[0026] The process of loading the algorithm rule base to be used and the browser data into the sandboxed execution environment running the WASM container for transformation and extraction to obtain the first data to be displayed includes:
[0027] The Reader data access module loads the algorithm rule base to be used and the browser data into the sandboxed execution environment running the WASM container;
[0028] The browser data is encapsulated by combining the sandboxed execution environment with the data type of the browser data to obtain the encapsulated data of the browser data;
[0029] The Writer data reading module extracts the encapsulated data from the WASM container to obtain the first data to be displayed.
[0030] The process of cleaning and repairing the first data to be displayed using a high-dimensional time series misalignment detection and repair algorithm to obtain the second data to be displayed includes:
[0031] Model each first data sequence of the first data to be displayed, and obtain the matching degree value of each first data sequence through normal pattern feature analysis;
[0032] The intermediate data from the data source to be processed is divided into a preset number of second data sequences according to the time dimension, and a data cleaning matrix is constructed based on each of the second data sequences.
[0033] Based on the data cleaning matrix and the matching degree value of each of the first data sequences, sequence anomaly pattern detection is performed to determine the abnormal data in the first data to be displayed.
[0034] Based on the anomaly filtering rules and the anomaly data, the first data to be displayed is matched against the wrong columns to obtain the second data to be displayed.
[0035] Secondly, this application also provides a visual ETL data processing apparatus comprising:
[0036] The configuration module is used to determine the data source to be processed and the ETL rules in the ETL visualization operation interface, and to manage the configuration through the ETL rules and the data source to be processed to obtain browser data.
[0037] The determination module is used to determine the algorithm rule base to be used in the WASM container for the browser data;
[0038] The processing and determination module is used to perform ETL data processing on the browser data according to the algorithm rule library to be used, to obtain the target display data, and to determine the data volume of the target display data;
[0039] The cached display module is used to cache the target display data in browser memory and / or WASM memory according to the data volume, and to visualize the target display data cached in browser memory and / or WASM memory through the WASM container.
[0040] Thirdly, this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the visualization ETL data processing method described in the first aspect.
[0041] Fourthly, this application also provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium including a computer program, which, when executed by the processor, implements the visualization ETL data processing method described in the first aspect.
[0042] Fifthly, this application also provides a computer program product, which includes a computer program that, when executed by the processor, implements the visualization ETL data processing method described in the first aspect.
[0043] The visualization ETL data processing method, apparatus, electronic device, and medium provided in this application utilize the independent and efficient loading characteristics of WASM on the web during the visualization ETL data processing process. Through the collaboration of WASM memory and browser memory caching, the atomic services in the mainstream ETL mode are integrated with WASM, reducing the dependence on the server side and realizing a fully memory-based visualization application service on the browser side. Attached Figure Description
[0044] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0045] Figure 1 This is a flowchart illustrating the visual ETL data processing method provided in this application;
[0046] Figure 2 This is a schematic diagram of the structure of the visual ETL data processing device provided in this application;
[0047] Figure 3 This is a schematic diagram of the structure of the electronic device provided in this application. Detailed Implementation
[0048] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0049] The following is combined with Figures 1 to 3 This application describes the visualization ETL data processing method, apparatus, electronic device, and medium provided. Figure 1 This is a flowchart illustrating the visual ETL data processing method provided in this application; Figure 2 This is a schematic diagram of the structure of the visual ETL data processing device provided in this application; Figure 3 This is a schematic diagram of the structure of the electronic device provided in this application.
[0050] This application provides an embodiment of a visual ETL data processing method. It should be noted that although the logical order is shown in the flowchart, under certain data conditions, the steps shown or described may be performed in a different order than that shown here.
[0051] This application uses an electronic device as an example to illustrate the execution subject, and a browser is used as one of the forms of presentation of the electronic device. It does not limit the electronic device.
[0052] Reference Figure 1 , Figure 1 This is a flowchart illustrating the visual ETL data processing method provided in this application. The visual ETL data processing method provided in this application includes:
[0053] Step S10: Determine the data source to be processed and the ETL rules in the ETL visualization operation interface, and configure and manage the ETL rules and the data source to be processed to obtain browser data.
[0054] It should be noted that the ETL visual operation interface is the same as the ETL online visual operation management interface, which is a web-based functional service from the user's perspective. Users can operate, configure, and manage the ETL data processing workflow through the ETL online visual operation management interface, including but not limited to the visual operation management of the entire process of configuring data sources, data relationship SQL, and cleaning rules. Visual operation management includes, but is not limited to, adding, deleting, modifying, and querying operations. ETL (Extract-Transform-Load) describes the process of extracting, transforming, and loading data from the source to the destination.
[0055] Before a user selects the data source and ETL rules to be processed in the online visual operation management interface of ETL, the browser needs to load and cache the cleaning algorithm and rule base. This can also be understood as the browser loading and caching the cleaning algorithm and rule base simultaneously with the user's startup, thus preparing the initialization environment before the online visual operation management interface of ETL. This is because it requires initial caching of the variables of the customized rule base and algorithm based on the native WASM service. The caching includes the connection management of several types of caches, laying the groundwork for subsequent dual-caching mechanism collaboration. The cleaning algorithms and rule bases involved in the embodiments of this application include, but are not limited to, heterogeneous multi-protocol data transformation rules, abnormal data detection / filtering rules, and mismatch detection algorithms (covariance matrix calculation).
[0056] Furthermore, a WASM file in WebAssembly (WASM) format is constructed based on the cleaning algorithm and rule base. In addition, the WASM file is streamed in the page using the native JS API (fetch) method.
[0057] Furthermore, the browser loads the localForage JS library and calls the localForage's setItem method to persist the WASM file on the client side. localForageJS can automatically select or determine whether to use webSql / localStorage for client-side cache persistence based on the client browser version and API support. This ensures that when the browser needs multiple instances of the WASM module or when the browser is refreshed, it can directly read the corresponding WASM module from the local database cache and quickly restore the page state.
[0058] Furthermore, a cache pool and a cache pool manager for WASM are constructed. The cache pool manager implements the caching and persistent storage management of intermediate and result data.
[0059] Further, after the browser loads and caches the cleaning algorithm and rule base, it determines the data source to be processed and the ETL rules in the ETL visualization interface. The data source to be processed supports multiple protocols, including but not limited to datasets of different formats, structures, and types, such as files, database tables, and network streaming data. Next, the browser manages the ETL rules and the data source to be processed online through visual configuration, obtaining browser data. This online configuration management includes visual configuration, configuration loading, and data loading, as described in steps S101 to S103.
[0060] Step S20: Determine the algorithm rule base to be used in the WASM container for the browser data;
[0061] Step S30: Perform ETL data processing on the browser data according to the algorithm rule base to be used to obtain target display data and determine the data volume of the target display data.
[0062] Furthermore, the browser determines the algorithm rule base to be used in the browser data within the WASM container. The algorithm rule base includes data transformation rules that are not limited to heterogeneous multi-protocols, abnormal data detection / filtering rules, and mismatch detection algorithms (covariance matrix calculation).
[0063] It should be noted that the core process of ETL data processing involves data extraction, transformation, and cleansing services. Therefore, the browser uses heterogeneous multi-protocol data transformation rules, abnormal data detection / filtering rules, mismatch detection algorithms, and resource environment to first extract and transform the browser data, then filter, detect, repair, clean, and summarize the data to obtain the target display data, as described in steps S301 to S304. Further, after obtaining the target display data, the browser needs to determine the amount of data in the target display data.
[0064] Step S40: Based on the data volume, cache the target display data in browser memory and / or WASM memory, and visualize the target display data cached in browser memory and / or WASM memory through the WASM container.
[0065] The browser compares the current data size with a preset data size, obtaining a comparison result. The result can be that the current data size is greater than the preset data size, or it can be that the current data size is less than or equal to the preset data size. If the comparison result indicates that the current data size is greater than the preset data size, the browser performs dual collaborative caching of the target display data using both browser memory and WASM memory. The preset data size is set based on the browser's performance. If the comparison result indicates that the current data size is less than or equal to the preset data size, the browser performs single collaborative caching of the target display data using either browser memory or WASM memory.
[0066] After the ETL data processing in steps S10 to S30, the target display data cached in the browser memory and / or WASM memory is displayed through the WASM container. It should be noted that this embodiment supports multiple data formats and types because the WASM container supports and communicates with web pages via binary streams, thus enabling the display of various document types and supporting document download. Furthermore, the browser aggregates the target display data obtained after the ETL data processing according to the needs of the data object. The final data display format may include, but is not limited to, chart data display, relational dataset display, streaming file display, and readable document display.
[0067] Furthermore, after displaying the target data, the browser will destroy the task set of this ETL data process and reset the WASM memory space, but will cache the default configuration and complete the data processing rules to meet the rule checks after the data results are finally presented to the user.
[0068] Furthermore, after processing the data in the WASM virtual environment, the binary data stream is returned to the browser. JavaScript receives the binary data through the front-end JavaScript stream function ArrayBuffer and parses and transforms it. After obtaining the data object result from WASM, the web page presents it in different ways depending on the data result, including UI rendering, file export and download, etc. When displaying complex data, WebGL can be used to render the returned data, capable of rendering numerous elements or pixels while maintaining excellent performance. Regular datasets or list information can be displayed by rendering the DOM within HTML, offering more flexible presentation options and saving resources.
[0069] This embodiment provides a visual ETL data processing method. During the visual ETL data processing, it leverages WASM's independent and efficient loading capabilities on the web. Through the collaborative caching of WASM memory and browser memory, it integrates the atomic services in mainstream ETL models with WASM, reducing reliance on the server side and achieving a fully in-memory visual application service on the browser side. Further, it can be understood as a new service model for ETL that integrates and customizes WASM. Combining WASM's characteristics, the entire ETL process is deeply customized. During initialization, the WASM container loads the necessary libraries and environment resources for ETL. Data and rules during the ETL process are stored through a dual-caching collaboration between WASM memory and browser memory. Result data or file conversions are processed using independently loaded WASM, and then visualized and interacted with on the web using JavaScript. In other words, users can perform online ETL data processing anytime, anywhere by opening a web browser to meet their data needs, such as self-service data analysis, without relying on server capabilities, running the processing entirely on the user's local resources.
[0070] Further, steps S101 to S103 are described as follows:
[0071] Step S101: Based on the ETL rules, the websql and the configuration parser of the WASM container manage, configure and parse the data source to be processed to obtain parsed data, and store the parsed data in a JS variable;
[0072] Step S102: Invoke the preset method in the WASM container to instantiate the WASM module, and pass the JS variable as parameter data to the instantiated WASM module;
[0073] Step S103: Combine the configuration parser and the source protocol converter of the WASM container to parse, convert, and load the JS variables in the instantiated WASM module into memory to obtain the browser data.
[0074] Specifically, the browser performs SQL-based management and configuration of the data source to be processed according to the WebSQL in the ETL rules. Simultaneously, the browser utilizes WebSQL's SQL semantic parsing capabilities, combined with the configuration parser built into the WASM container, to perform the first configuration parsing. Then, the configuration parser obtains the WebSQL syntax tree for a second configuration parsing, resulting in parsed data, as described in steps S3011 to S3013. The browser then stores the parsed data in a JS variable.
[0075] Furthermore, the browser calls the default method of the WASM container to instantiate the WASM module. In this embodiment, the default method can be the instantiate method. Therefore, it can be understood that the browser calls the instantiate method in the WASM container to compile and instantiate the WASM module from the underlying source, and creates memory space when instantiating the WASM module.
[0076] Furthermore, the browser passes the JS variables of the parsed data as parameters to the instantiated WASM module. The parser then performs the parsing, and the source protocol converter within the WASM container completes the transformation and memory loading of the JS variables, resulting in the browser data. It's important to note that data from different protocols is loaded into WASM memory via the source protocol converter, while the JS variables of the parsed data are persistently stored using localForageJS. The source protocol converter achieves unified conversion of files to memory, streams to memory, and table data to memory based on heterogeneous data sources and multi-protocol data conversion.
[0077] It should be noted that, in order to ensure the input of mainstream data types, the data types in this embodiment are only used for the configuration submission of the dataset, and both formatted and unformatted data are supported, such as SQL, Excel, JSON, HDFS, and XML formats.
[0078] This application embodiment adopts a visualized management approach for the entire ETL process. Instead of combining front-end and back-end operations for online ETL, the entire ETL service processing process is migrated to the web for visualized operation and management. The process status can be managed, and manual intervention can be performed online at any time. Cached or temporarily stored data during the ETL process can be viewed at any time.
[0079] Further, steps S1031 to S1033 are described as follows:
[0080] Step S1031: Perform SQL-based management and configuration of the data source to be processed using the websql to obtain the configuration data to be processed;
[0081] Step S1032: Combining the semantic parsing capability of the websql and the configuration parser, perform the first configuration parsing on the configuration data to be processed to obtain the parsed data to be processed;
[0082] Step S1033: Combine the semantic parsing data of the websql with the configuration parser to perform a second configuration parsing on the parsing data to be processed, and obtain the parsing data.
[0083] Specifically, the browser performs SQL-based management and configuration of the data source to be processed according to the WebSQL in the ETL rules, obtaining the configuration data to be processed. Next, the browser uses WebSQL's SQL semantic parsing capabilities, combined with the configuration parser built into the WASM container, to perform the first configuration parsing of the configuration data to be processed, obtaining the parsed data to be processed. Finally, the browser uses the configuration parser to obtain the WebSQL syntax tree to perform a second configuration parsing of the parsed data to be processed, obtaining the parsed data.
[0084] This application embodiment uses online ETL data processing, which can be accessed and operated online at any time, and is executed entirely locally. It does not depend on the communication efficiency and resource pressure of the server, and the use of data in the cache is not affected when the page is refreshed.
[0085] Further, steps S301 to S304 are described as follows:
[0086] Step S301: Load the algorithm rule base to be used and the browser data into the sandboxed execution environment running the WASM container for conversion and extraction to obtain the first data to be displayed;
[0087] Step S302: The first data to be displayed is cleaned and repaired using a high-dimensional time series misalignment detection and repair algorithm to obtain the second data to be displayed;
[0088] Step S303: Group the second data to be displayed, generate data objects for each group, and perform data transformation on the data objects of each group using the built-in function library in the WASM container to obtain the transformation results of each group;
[0089] Step S304: The main process aggregates the transformation results of each group to obtain the target display data.
[0090] Specifically, the browser calls the corresponding methods exposed in the instantiated WASM module object via JavaScript to load the algorithm rule base to be used and the browser data into the sandboxed execution environment in which the WASM container runs. After loading the algorithm rule base to be used and the browser data into the WASM container, the browser encapsulates the data according to the data type of the browser data, encapsulating it into the first data to be displayed in different data objects, as described in steps S3011 to S3013.
[0091] Furthermore, due to the diverse types of data sources, to address the issues of heterogeneous data source synchronization and multi-protocol data source conversion, this application embodiment uses the DataX framework to solve the complexity of mesh-structured data links, converting them into star-shaped data links. This simplifies the data link while enabling seamless connection between different data sources. Further, the WASM container described in this application embodiment acts as an intermediate transmission carrier, responsible for connecting various data sources. Specifically, the Reader data access module sends the algorithm rule base and browser data to be used to the WASM container, while the Writer module reads data from the WASM container and writes it to the destination, i.e., a unified output source—WASM memory. The Reader data access module and the Writer module read module are responsible for converting heterogeneous data and protocols. A protocol converter is used to interface with the Reader data access module and the Writer module read module. The protocol converter, as the core, employs mechanisms such as double-buffered queues and sliding window flow control to handle issues in high-speed data exchange.
[0092] Furthermore, the browser performs data cleaning and transformation based on the first data to be displayed obtained after conversion by the protocol converter and the algorithm rule base to be used. In this embodiment, the algorithm rule base to be used may include a misalignment detection and repair algorithm in high-dimensional time series. Therefore, it can be understood that the browser cleans and transforms the first data to be displayed using a misalignment detection and repair algorithm in high-dimensional time series to obtain the second data to be displayed. The data cleaning and transformation includes sequence anomaly pattern detection, misalignment matching detection algorithm, and misalignment matching filtering, as described in steps S3021 to S3024. It should be noted that the purpose of data cleaning and transformation is to remove abnormal data. Therefore, data cleaning and transformation can be understood as the browser removing abnormal data (abnormal data items and abnormal data columns) from the first data to be displayed.
[0093] Furthermore, the browser groups the second data to be displayed according to its key value, generating different data objects for each group. Next, the browser uses the WASM container's built-in function library to transform the data objects of each group, obtaining the transformation results for each group, and stores these results in the cache nodes of the WASM cache pool. Once all group transformation tasks are complete, the main JS process is notified. Finally, the browser aggregates the transformation results of each group through the main JS process to obtain the target display data.
[0094] This application demonstrates the application of a mismatch detection algorithm in a new scenario of WASM processing ETL. The algorithm is customized and modified for the WASM front-end, and abnormal columns are filtered to achieve a lightweight ETL cleaning algorithm, thus completing the fusion cleaning on WASM.
[0095] Further, steps S3011 to S3013 are described as follows:
[0096] Step S3011: Load the algorithm rule base to be used and the browser data into the sandboxed execution environment running by the WASM container through the Reader data access module;
[0097] Step S3012: The browser data is encapsulated by combining the sandboxed execution environment with the data type of the browser data to obtain the encapsulated browser data.
[0098] Step S3013: Extract the encapsulated data from the WASM container using the Writer data reading module to obtain the first data to be displayed.
[0099] Specifically, the browser loads the algorithm rule base and browser data to be used into the sandboxed execution environment where the WASM container runs through the Reader data access module. Next, the browser encapsulates the browser data based on the data type within the sandboxed execution environment, resulting in encapsulated browser data. Finally, the browser extracts the encapsulated data from the WASM container through the Writer data reading module, obtaining the first set of data to be displayed.
[0100] Further, steps S3021 to S3024 are described as follows:
[0101] Step S3021: Model each first data sequence of the first data to be displayed, and obtain the matching degree value of each first data sequence through normal pattern feature analysis;
[0102] Step S3022: Divide the intermediate data of the data source to be processed into a preset number of second data sequences according to the time dimension, and construct a data cleaning matrix based on each second data sequence;
[0103] Step S3023: Based on the data cleaning matrix and the matching degree value of each of the first data sequences, perform sequence abnormality pattern detection to determine abnormal data in the first data to be displayed;
[0104] Step S3024: According to the abnormal filtering rules, the first data to be displayed is matched against the abnormal data to obtain the second data to be displayed.
[0105] Specifically, the browser first models each first data sequence in the first set of data to be displayed. Through normal pattern feature analysis, during detection, the predicted value of the first data sequence matching the model for that column is used as the matching degree. Simultaneously, the browser records the set of mismatched columns and persists this data. Further, the browser extracts intermediate data from the data source to be processed and divides it into a preset number of K different second data sequences according to the time dimension. These K different second data sequences can be represented as S. k m 'm' represents a data group, divided according to user-defined business lines. Further, the browser groups the k distinct second data sequences into the l-th segment of S; therefore, the k-th sequence within the l-th segment is represented as... It should be noted that, in this embodiment, the correlation of the data sequence is calculated using the covariance matrix, denoted as ST. l This is used to calculate the correlation of k sequences within the l-th segment of group S, where the correlation of the k-th sequence within the l-th segment of group S is:
[0106]
[0107] Where element R represents the i-th sequence and the j-th sequence The relevant parameter values are in the range [-1, 1]. The values of R in the matrix are calculated by the following formula.
[0108]
[0109]
[0110] Where S_average represents the average value of all sequence data points, the correlation cleaning result data matrix ST of k data sequences can be obtained from the above formula.
[0111] Furthermore, the browser performs sequence anomaly pattern detection based on the data cleaning matrix and the matching degree values of each first data sequence to identify abnormal data in the first data to be displayed. Based on the data cleaning matrix, the data structure is cleaned, and the data sequence matching degree is calculated to improve the efficiency of the fast anomaly detection method for data cleaning.
[0112] After the browser identifies abnormal data in the first set of data to be displayed, it performs batch processing checks on the abnormal data (abnormal items and abnormal columns) in the first set of data to be displayed by verifying the abnormal filtering rules. This process will obtain all abnormal items to be cleaned. After filtering out all cleaned abnormal items from the sequence set of the data, the required normal columns can be obtained, thus completing the complete cleaning of the entire data sequence and obtaining the second set of data to be displayed.
[0113] This application demonstrates the application of a mismatch detection algorithm in a novel ETL processing scenario using WASM. The algorithm has been customized for the WASM front-end. Correlation analysis of the data sequence is calculated using the covariance matrix, and the average value of all sequence data points is matched against the target. Abnormal columns are filtered out to achieve a lightweight ETL cleaning algorithm, completing the fusion cleaning process on WASM.
[0114] Further, steps S401 to S402 are described as follows:
[0115] Step S401: If the amount of data is greater than the preset amount of data, then the target display data is cached in a dual collaborative manner using the browser memory and the WASM memory.
[0116] Step S402: If the data volume is less than or equal to the preset data volume, then the target display data is cached using the browser memory or the WASM memory.
[0117] Specifically, if the data volume is determined to be greater than the preset data volume, the browser performs dual collaborative caching of the target display data using both browser memory and the WASM memory. If the data volume is determined to be less than or equal to the preset data volume, the browser performs single collaborative caching of the target display data using either browser memory or WASM memory.
[0118] This application embodiment utilizes WASM's independent and efficient loading characteristics on the web, and through the collaborative caching of WASM memory and browser memory, integrates the atomic services in the mainstream ETL model with WASM, weakens the dependence on the server side, and realizes a fully in-memory visual application service on the browser side.
[0119] In this embodiment, since the application services in steps S10 to S30 are located on the page and within the WASM within the page, a mechanism needs to be designed to support their mutual invocation during visual operations. Furthermore, to handle larger data volumes during ETL processing in WASM, it's necessary to leverage the characteristics of different caches as much as possible. Therefore, a dual-caching collaborative mechanism is implemented, involving cooperation between WASM memory and browser persistent cache. Further, because the client automatically manages memory through the browser engine's garbage collection mechanism, each browser limits the amount of memory that JavaScript can use, although these limits vary slightly. This undoubtedly limits the browser's ability to process large amounts of data. Therefore, a dual-caching collaborative processing mode using browser memory and WASM memory is considered. Relying on WASM's high computational power, allocable large-capacity memory mechanism, and memory management mechanism that allows manual creation and destruction of memory, combined with localForageJS for client-side persistent storage, the memory usage limitations imposed by the browser's inherent characteristics are overcome.
[0120] Furthermore, after extracting the ETL dataset, temporary storage is required on the client side. This embodiment uses WASM-based Virtual Memory System (VAS) management, which shields the underlying layers: virtual address programming is easier; access control resolves illegal memory access issues; and efficiency is high, as the virtual memory address space can be larger than physical memory. Flexible allocation (caching, LRU) is possible. Besides memory segmentation management, the ETL data cleaning process can also be segmented, primarily describing the organization of data by the program. Furthermore, virtual memory alignment management is implemented. If the effective address of a memory access is a multiple of the memory access alignment attribute, then the memory access is considered aligned; otherwise, it is unaligned. Aligned and unaligned accesses exhibit the same behavior, but alignment improves CPU processing speed.
[0121] During the data processing and transformation process in step S30 above, based on the size and frequency of use of the original data, intermediate data, and result data, the data is processed and transformed from the browser's persistent cache to the memory in WASM, and then back to the browser's persistent cache. This collaborative process maximizes the utilization of the client's capabilities, overcomes the limitations of the browser's own characteristics, and completes the processing and transformation of large amounts of data.
[0122] Furthermore, the visualization ETL data processing apparatus provided in this application will be described below. The visualization ETL data processing apparatus described below can be referred to in correspondence with the visualization ETL data processing method described above.
[0123] like Figure 2 As shown, Figure 2 This is a schematic diagram of the structure of the visual ETL data processing device provided in this application. The visual ETL data processing device includes:
[0124] The configuration module 201 is used to determine the data source to be processed and the ETL rules in the ETL visualization operation interface, and to perform configuration management through the ETL rules and the data source to be processed to obtain browser data.
[0125] The determination module 202 is used to determine the algorithm rule base to be used in the browser data in the WASM container;
[0126] The processing and determination module 203 is used to perform ETL data processing on the browser data according to the algorithm rule library to be used, to obtain the target display data, and to determine the data volume of the target display data;
[0127] The cached display module 204 is used to cache the target display data in browser memory and / or WASM memory according to the data volume, and to visualize the target display data cached in browser memory and / or WASM memory through the WASM container.
[0128] Furthermore, the configuration module 201 is also used for:
[0129] The configuration parser of the websql based on the ETL rules and the WASM container manages, configures and parses the data source to be processed, obtains parsed data, and stores the parsed data in JS variables;
[0130] The WASM module is instantiated by calling a preset method in the WASM container, and the JS variable is passed as parameter data to the instantiated WASM module.
[0131] The configuration parser and the source protocol converter of the WASM container are used to parse, transform, and load the JS variables in the instantiated WASM module into memory to obtain the browser data.
[0132] Furthermore, the configuration module 201 is also used for:
[0133] The websql is used to perform SQL-based management and configuration of the data source to be processed, thereby obtaining the configuration data to be processed.
[0134] The semantic parsing capability of the websql and the configuration parser are combined to perform the first configuration parsing on the configuration data to be processed, and the parsed data to be processed is obtained.
[0135] The semantic parsing data of the WebSQL and the configuration parser are combined to perform a second configuration parsing on the parsing data to be processed, thereby obtaining the parsed data.
[0136] Furthermore, the cache display module 204 is also used for:
[0137] If the amount of data is greater than the preset amount of data, the target display data is cached in a dual collaborative manner using the browser memory and the WASM memory;
[0138] If the data volume is less than or equal to the preset data volume, the target display data is cached using the browser memory or the WASM memory in a single collaborative manner.
[0139] Furthermore, the processing and determination module 203 is also used for:
[0140] The algorithm rule base to be used and the browser data are loaded into the sandboxed execution environment running the WASM container for transformation and extraction to obtain the first data to be displayed;
[0141] The first data to be displayed is cleaned and repaired using a high-dimensional time series misalignment detection and repair algorithm to obtain the second data to be displayed.
[0142] The second data to be displayed is grouped to generate data objects for each group, and the data objects of each group are transformed using the built-in function library in the WASM container to obtain the transformation results of each group.
[0143] The main process aggregates the transformation results of each group to obtain the target display data.
[0144] Furthermore, the processing and determination module 203 is also used for:
[0145] The Reader data access module loads the algorithm rule base to be used and the browser data into the sandboxed execution environment running the WASM container;
[0146] The browser data is encapsulated by combining the sandboxed execution environment with the data type of the browser data to obtain the encapsulated data of the browser data;
[0147] The Writer data reading module extracts the encapsulated data from the WASM container to obtain the first data to be displayed.
[0148] Furthermore, the processing and determination module 203 is also used for:
[0149] Model each first data sequence of the first data to be displayed, and obtain the matching degree value of each first data sequence through normal pattern feature analysis;
[0150] The intermediate data from the data source to be processed is divided into a preset number of second data sequences according to the time dimension, and a data cleaning matrix is constructed based on each of the second data sequences.
[0151] Based on the data cleaning matrix and the matching degree value of each of the first data sequences, sequence anomaly pattern detection is performed to determine the abnormal data in the first data to be displayed.
[0152] Based on the anomaly filtering rules and the anomaly data, the first data to be displayed is matched against the wrong columns to obtain the second data to be displayed.
[0153] The specific embodiments of the visualization ETL data processing device provided in this application are basically the same as the embodiments of the visualization ETL data processing method described above, and will not be repeated here.
[0154] Figure 3 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 3 As shown, the electronic device may include: a processor 310, a communication interface 320, a memory 330, and a communication bus 340, wherein the processor 310, the communication interface 320, and the memory 330 communicate with each other via the communication bus 340. The processor 310 can call logical instructions in the memory 330 to execute a visual ETL data processing method, which includes:
[0155] The system identifies the data source to be processed and the ETL rules in the ETL visualization interface, and manages the configuration using the ETL rules and the data source to be processed to obtain browser data.
[0156] Determine the algorithm rule base to be used in the WASM container for the browser data;
[0157] The browser data is processed by ETL according to the algorithm rule base to be used to obtain the target display data, and the data volume of the target display data is determined.
[0158] Based on the data volume, the target display data is cached in browser memory and / or WASM memory, and the target display data cached in browser memory and / or WASM memory is visualized through the WASM container.
[0159] Furthermore, the logical instructions in the aforementioned memory 330 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0160] On the other hand, this application also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium. The computer program includes program instructions, and when the program instructions are executed by a computer, the computer is able to execute the visualization ETL data processing method provided by the above methods, the method including:
[0161] The system identifies the data source to be processed and the ETL rules in the ETL visualization interface, and manages the configuration using the ETL rules and the data source to be processed to obtain browser data.
[0162] Determine the algorithm rule base to be used in the WASM container for the browser data;
[0163] The browser data is processed by ETL according to the algorithm rule base to be used to obtain the target display data, and the data volume of the target display data is determined.
[0164] Based on the data volume, the target display data is cached in browser memory and / or WASM memory, and the target display data cached in browser memory and / or WASM memory is visualized through the WASM container.
[0165] In another aspect, this application also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, is implemented to perform the aforementioned visualization ETL data processing methods, the method comprising:
[0166] The system identifies the data source to be processed and the ETL rules in the ETL visualization interface, and manages the configuration using the ETL rules and the data source to be processed to obtain browser data.
[0167] Determine the algorithm rule base to be used in the WASM container for the browser data;
[0168] The browser data is processed by ETL according to the algorithm rule base to be used to obtain the target display data, and the data volume of the target display data is determined.
[0169] Based on the data volume, the target display data is cached in browser memory and / or WASM memory, and the target display data cached in browser memory and / or WASM memory is visualized through the WASM container.
[0170] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0171] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0172] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A visual ETL data processing method, characterized in that, include: The system identifies the data source to be processed and the ETL rules in the ETL visualization interface, and manages the configuration using the ETL rules and the data source to be processed to obtain browser data. Determine the algorithm rule base to be used in the WASM container for the browser data; The browser data is processed by ETL according to the algorithm rule base to be used to obtain the target display data, and the data volume of the target display data is determined. Based on the data volume, the target display data is cached in browser memory and / or WASM memory, and the target display data cached in browser memory and / or WASM memory is visualized through the WASM container; The step of caching the target display data based on the data volume using browser memory and / or WASM memory includes: If the amount of data is greater than the preset amount of data, the target display data is cached in a dual collaborative manner using the browser memory and the WASM memory; If the data volume is less than or equal to the preset data volume, the target display data is cached using the browser memory or the WASM memory in a single collaborative manner.
2. The visual ETL data processing method according to claim 1, characterized in that, The process of configuring and managing the ETL rules and the data source to be processed to obtain browser data includes: The configuration parser of the websql based on the ETL rules and the WASM container manages, configures and parses the data source to be processed, obtains parsed data, and stores the parsed data in JS variables; The WASM module is instantiated by calling a preset method in the WASM container, and the JS variable is passed as parameter data to the instantiated WASM module. The configuration parser and the source protocol converter of the WASM container are used to parse, transform, and load the JS variables in the instantiated WASM module into memory to obtain the browser data.
3. The visual ETL data processing method according to claim 2, characterized in that, The configuration parser based on the ETL rules in WebSQL and the WASM container manages, configures, and parses the data source to be processed, obtaining parsed data, including: The websql is used to perform SQL-based management and configuration of the data source to be processed, thereby obtaining the configuration data to be processed. The semantic parsing capability of the websql and the configuration parser are combined to perform the first configuration parsing on the configuration data to be processed, and the parsed data to be processed is obtained. The semantic parsing data of the WebSQL and the configuration parser are combined to perform a second configuration parsing on the parsing data to be processed, thereby obtaining the parsed data.
4. The visual ETL data processing method according to claim 1, characterized in that, The step of performing ETL data processing on the browser data according to the algorithm rule base to be used to obtain the target display data includes: The algorithm rule base to be used and the browser data are loaded into the sandboxed execution environment running the WASM container for transformation and extraction to obtain the first data to be displayed; The first data to be displayed is cleaned and repaired using a high-dimensional time series misalignment detection and repair algorithm to obtain the second data to be displayed. The second data to be displayed is grouped to generate data objects for each group, and the data objects of each group are transformed using the built-in function library in the WASM container to obtain the transformation results of each group. The main process aggregates the transformation results of each group to obtain the target display data.
5. The visual ETL data processing method according to claim 4, characterized in that, The process of loading the algorithm rule base to be used and the browser data into the sandboxed execution environment running the WASM container for transformation and extraction to obtain the first data to be displayed includes: The Reader data access module loads the algorithm rule base to be used and the browser data into the sandboxed execution environment running the WASM container; The browser data is encapsulated by combining the sandboxed execution environment with the data type of the browser data to obtain the encapsulated data of the browser data; The Writer data reading module extracts the encapsulated data from the WASM container to obtain the first data to be displayed.
6. The visual ETL data processing method according to claim 4, characterized in that, The process of cleaning and repairing the first data to be displayed using a high-dimensional time series misalignment detection and repair algorithm to obtain the second data to be displayed includes: Model each first data sequence of the first data to be displayed, and obtain the matching degree value of each first data sequence through normal pattern feature analysis; The intermediate data from the data source to be processed is divided into a preset number of second data sequences according to the time dimension, and a data cleaning matrix is constructed based on each of the second data sequences. Based on the data cleaning matrix and the matching degree value of each of the first data sequences, sequence anomaly pattern detection is performed to determine the abnormal data in the first data to be displayed. Based on the anomaly filtering rules and the anomaly data, the first data to be displayed is matched against the wrong columns to obtain the second data to be displayed.
7. A visual ETL data processing device, characterized in that... include: The configuration module is used to determine the data source to be processed and the ETL rules in the ETL visualization operation interface, and to manage the configuration through the ETL rules and the data source to be processed to obtain browser data. The determination module is used to determine the algorithm rule base to be used in the WASM container for the browser data; The processing and determination module is used to perform ETL data processing on the browser data according to the algorithm rule library to be used, to obtain the target display data, and to determine the data volume of the target display data; The cached display module is used to cache the target display data in browser memory and / or WASM memory according to the data volume, and to visualize the target display data cached in browser memory and / or WASM memory through the WASM container. The cached display module is also used to: if the data volume is greater than a preset data volume, to perform dual collaborative caching of the target display data in browser memory and WASM memory. If the data volume is less than or equal to the preset data volume, the target display data is cached using the browser memory or the WASM memory in a single collaborative manner.
8. An electronic device, the electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the visual ETL data processing method according to any one of claims 1 to 6.
9. A non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the visual ETL data processing method according to any one of claims 1 to 6.