Data lake index creation methods, apparatus, electronic devices, and computer storage media

By acquiring dynamic data change information from the data lake, extracting data features, and automatically creating indexes, the problem of inflexible data lake index creation is solved, achieving efficient management of data lake indexes and improving query performance.

CN116186041BActive Publication Date: 2026-06-30CHINA MOBILE INFORMATION TECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA MOBILE INFORMATION TECHNOLOGY CO LTD
Filing Date
2023-02-21
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing data lake indexes lack flexibility in creation, failing to automatically create indexes based on the data distribution and value characteristics of massive datasets, resulting in a lack of flexibility and performance improvement in query analysis.

Method used

By acquiring dynamic data change information of the target data, extracting data features, and automatically creating data lake indexes based on data features, the system utilizes automatic feature engineering and real-time metadata exploration services to achieve automatic management and optimization of the indexes.

Benefits of technology

It improves the flexibility of data lake index creation and data retrieval performance, ensures that index creation matches data characteristics, and enhances the efficiency of data querying.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116186041B_ABST
    Figure CN116186041B_ABST
Patent Text Reader

Abstract

This application discloses a method, apparatus, electronic device, and computer storage medium for creating a data lake index. When target data is acquired from the data lake, dynamic change information of the target data is obtained, and then data features of the target data are extracted based on this dynamic change information. Finally, a data lake index is created based on the extracted data features. In this way, during the dynamic data inflow process, the change records of the target data are collected and explored in real time, constructing and generating data features of the target data, thereby triggering the index management service to automatically create the index. This improves the flexibility of data lake index creation, and because the index is created specifically based on the data features of the inflow data, data retrieval performance can be improved when querying data based on the created data lake index.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of big data technology, and in particular relates to a method, apparatus, electronic device and computer storage medium for creating a data lake index. Background Technology

[0002] A data lake is defined as a highly scalable data storage area that stores large amounts of raw data in its original format until it is needed. Data lakes can store all types of data, with no fixed limits on account size or file size, and no defined specific purpose. The data comes from diverse sources and can be structured, semi-structured, or even unstructured, and can be queried on demand.

[0003] To achieve better data query and processing performance, existing data lake indexes typically select fixed indexes at the code level or user-defined fixed indexes, requiring pre-setting of indexes, which makes the creation of current data lake indexes inflexible. Summary of the Invention

[0004] This application provides a data lake index creation method, apparatus, electronic device, and computer storage medium, which can automatically trigger index creation based on the data distribution and data value characteristics of massive data, thereby improving the flexibility of data lake index creation.

[0005] In a first aspect, embodiments of this application provide a method for creating a data lake index, which may include:

[0006] Once the target data is acquired from the data lake, the dynamic change information of the target data is obtained. This dynamic change information is used to map the transaction actions of the target data during its entry into the data lake.

[0007] Extract data features from the target data based on its dynamic changes.

[0008] Create a data lake index based on the data characteristics of the extracted target data.

[0009] In one embodiment, the above-mentioned extraction of data features of the target data based on the dynamic change information of the target data includes:

[0010] Generate a data change log file based on the dynamic changes in the target data;

[0011] Based on the data change log file, data features of the target data are extracted. In one embodiment, the above-mentioned process of generating a data change log file based on the dynamic data change information of the target data includes:

[0012] Record action-type change information and data statistics-type change information of target data in the dynamic change information of data, and generate a data change log file. Action-type change information is used to indicate change actions of inserting, deleting or updating target data, and data statistics-type change information is used to indicate change actions of statistical analysis of target data.

[0013] In one embodiment, the data change log file mentioned above includes at least one data change record of the target data;

[0014] Based on the data change log file, extract the data features of the target data, including:

[0015] Parse the data change log file to obtain at least one data change record;

[0016] Based on at least one data change record and the type of the target data, feature construction is performed on the target data through automatic feature engineering to obtain the data features of the target data. The type of the target data includes any one of the following: text type, data type, category type, geospatial type, date and time type, and dimension type.

[0017] In one embodiment, the automated feature engineering described above includes at least one feature primitive;

[0018] Based on at least one data change record and the type of the target data, feature construction is performed on the target data using automatic feature engineering to obtain the data features of the target data, including:

[0019] Based on at least one data change record, and for the type of target data, feature construction is performed on the target data using at least one feature primitive in automatic feature engineering to extract the data features of the target data.

[0020] In one embodiment, after creating a data lake index based on the data characteristics of the extracted target data, the process further includes:

[0021] When a query request is received, the query cost of each of the multiple preset query paths is calculated based on the data lake index, and the preset query path whose query cost meets the preset conditions is selected as the target query path.

[0022] Based on the target query path, search for the data corresponding to the query request.

[0023] In one embodiment, after creating a data lake index based on the data characteristics of the extracted target data, the process further includes:

[0024] Perform data feature analysis on the target data to obtain the data feature analysis results;

[0025] Based on the data feature analysis results of the target data, the preset data lake index is updated. The preset data lake index is the initial index set for the data lake.

[0026] Secondly, embodiments of this application provide a data lake index creation apparatus, which may include:

[0027] The acquisition module is used to acquire dynamic data change information of the target data when the target data is acquired from the data lake. The dynamic data change information is used to map the transaction actions of the target data during the process of entering the data lake.

[0028] The extraction module is used to extract data features from the target data based on the dynamic changes in the target data.

[0029] The creation module is used to create a data lake index based on the data characteristics of the extracted target data.

[0030] Thirdly, embodiments of this application provide an electronic device, the device comprising:

[0031] processor;

[0032] Memory used to store processor-executable instructions;

[0033] The processor is configured to execute instructions to implement the data lake index creation method as shown in any embodiment of the first aspect.

[0034] Fourthly, embodiments of this application provide a computer storage medium on which a computer program is stored, which, when executed by a processor, implements the data lake index creation method as shown in any embodiment of the first aspect.

[0035] Fifthly, embodiments of this application also provide a computer program product comprising a computer program stored in a readable storage medium, wherein at least one processor of the device reads from the storage medium and executes the computer program, causing the device to perform the data lake index creation method shown in any embodiment of the first aspect.

[0036] This application provides a data lake index creation method, apparatus, electronic device, and computer storage medium. Compared with the prior art, this application has the following advantages:

[0037] This application discloses a data lake index creation method, apparatus, electronic device, and computer storage medium. When target data is obtained from the data lake, dynamic change information of the target data is acquired, and data features of the target data are extracted based on this dynamic change information. Finally, a data lake index is created based on the extracted data features of the target data.

[0038] In this way, during the dynamic data inflow process, the change records of the target data are collected and explored in real time to construct the data characteristics of the target data, thereby triggering the index management service to automatically create the index. This improves the flexibility of data lake index creation. Furthermore, since the index is created specifically based on the data characteristics of the inflow data, the data retrieval performance can be improved when querying data based on the created data lake index. Attached Figure Description

[0039] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0040] Figure 1 This is a flowchart illustrating a data lake index creation method provided in an embodiment of this application;

[0041] Figure 2 This is a flowchart illustrating another data lake index creation method provided in an embodiment of this application;

[0042] Figure 3 This is a schematic diagram of the architecture of a data lake index creation system provided in an embodiment of this application;

[0043] Figure 4 This is a schematic diagram of the structure of a data lake index creation device provided in an embodiment of this application;

[0044] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0045] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.

[0046] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.

[0047] First, let's introduce the proper nouns used in this application.

[0048] 1. Data organization methods of data lakes

[0049] Data lakes are a flexible and highly compatible centralized storage system that has been proposed in the field of big data in recent years. They are generally composed of storage objects such as HDFS, OSS, and S3, storage formats such as ORC and Parquet, and open organizational forms such as Iceberg and Hudi, which together form a unified storage format for data lakes from the bottom up.

[0050] At the open organization layer, data lake metadata management includes information such as physical data file directories, statistics, version management, and file assembly. For example, metadata pairs contain statistical information such as the Min / Max values ​​of data in data files, which speeds up data retrieval.

[0051] II: Indexing Technology

[0052] Indexing technology is widely used in the database field. It speeds up data retrieval, reduces unnecessary data reads, and significantly lowers disk I / O and CPU load. Different index categories involve different index implementation techniques, and in certain specific scenarios, an index directly represents the original data, allowing for direct data retrieval.

[0053] However, in the field of big data, due to the massive amounts of data involved, as well as special storage formats and unique application scenarios, indexes and database indexes differ significantly in their technical implementation. Big data indexes primarily focus on reducing access to data files, skipping irrelevant data files, and accelerating query speeds. Because of the sheer volume of data, effective index design can greatly improve the performance of accessing large datasets.

[0054] 3. Cost Based Optimizer (CBO)

[0055] In open engines or data warehouses such as Clickhouse and Snowflake, the SQL engine first performs syntax and lexical analysis on the SQL statement submitted by the client to generate a syntax tree (AST tree). After traversing the tree structure to generate a logical execution plan, the plan enters the optimizer for execution pruning to reduce execution overhead and improve query performance. Finally, a physical execution plan is generated and enters the execution engine for final query execution. Optimizers are generally divided into two types: Rule-Based Optimizer (RBO) and Cost-Based Optimizer (CBO). RBO is a rule-based optimizer that performs execution replacements based on built-in rules, such as predicate pushdown and constant merging. However, this optimization method is relatively simple and fixed, with low flexibility. CBO is a data-sensitive optimization method that calculates the resource consumption of various possible "execution plans" based on data statistics and a certain cost calculation model. These plans typically include disk I / O, CPU, and memory, and select the execution plan with the lowest resource consumption as the optimal running plan.

[0056] Optimizers are widely used in existing relational databases such as MySQL and Oracle, batch processing engines such as Hive and Spark, and stream processing engines such as Flink. However, they differ in design and functionality. For example, statistics in Hive tables are not strongly written, and in many cases, statistics are missing or outdated, making it impossible to optimize the query plan based on execution cost.

[0057] Metadata is defined as data that describes data. While essentially still data, it can be viewed as an electronic catalog used to describe the attributes or content of data and information resources, and to assist users in retrieving and using data. It is understood that, in this embodiment, metadata refers to data within the data lake used to describe the attributes or content of target data and information resources, and to assist users in retrieving and using target data. The target data refers to data entering the data lake in real-time and / or in batches.

[0058] In related technologies, to achieve optimal data query and processing performance, existing database or data warehouse systems have implemented strong adaptations in query engines and storage, ensuring that computation and storage exist in an optimal combination. For example, proprietary storage formats are used to adapt to engine processing capabilities to achieve thorough SQL optimization. For instance, ClickHouse periodically merges files to a specific size or sorts data in the background to adapt to engine-side queries. Simultaneously, many optimization methods for execution engines are gradually emerging. However, this performance advantage brings problems such as strong engine and storage binding and special customization. This necessitates multiple copies of the same data being transferred between different data warehouses when sharing data openly. IT system construction often becomes siloed and unregulated, lacking unified control, resulting in compromised data consistency and significantly reduced resource utilization. In this context, the concept of a data lake has emerged.

[0059] The new stage of big data development—the data lake stage—is characterized by its flexible, open, and shared standard format. This means that data lakes cannot optimally adapt to all engines at the data storage organization level to achieve performance similar to data warehouses. Therefore, the aforementioned solution of strongly binding storage and engines to achieve performance improvements is not suitable for data lakes. One existing approach to optimizing open data lakes is to optimize storage and engines separately in parallel.

[0060] In terms of data lake storage organization optimization, fixed indexes are predefined and created in advance, such as Min / Max, Bloom Filter, bitmap indexes and other organization and sorting methods.

[0061] At the open engine level, significant optimizations have been made at the SQL runtime level to minimize the performance gap with the data warehouse. However, it still remains strongly tied to the storage query engine in terms of performance, and there is still a certain gap with highly customized databases.

[0062] Current data lake indexes typically select fixed indexes or user-defined fixed indexes at the code level. Because they cannot perform real-time metadata statistics and cannot automatically build indexes based on data feature distribution, the data indexes and data organization cannot adapt to the feature distribution of real data. Therefore, they cannot automatically create indexes and need to pre-set indexes, which makes the current data lake query and analysis lack flexibility and there are fewer solutions designed to accelerate the process in conjunction with the upper-layer engine.

[0063] Disadvantages of existing technical solutions:

[0064] 1. Lack of automated index management system; fixed and single index types; index types need to be manually specified when creating the schema.

[0065] 2. Unable to automatically select specific indexes to trigger creation and deletion based on the data distribution and data value characteristics of massive amounts of data;

[0066] 3. Lack of real-time driven metadata exploration and statistics services, and lack of timeliness in extracting statistical features of data distribution;

[0067] 4. Lacks open indexing interfaces, unable to support flexible index customization;

[0068] 5. The limited variety of index types makes it impossible to provide the engine-side CBO optimizer with multiple execution paths and thus evaluate the optimal execution plan.

[0069] To address the problems existing in the prior art, embodiments of this application provide a data lake index creation method, apparatus, electronic device, and computer storage medium. When target data is obtained from the data lake, dynamic change information of the target data is acquired, and then data features of the target data are extracted based on this dynamic change information. Finally, a data lake index is created based on the extracted data features of the target data.

[0070] In this way, during the dynamic data inflow process, the change records of the target data are collected and explored in real time to construct the data characteristics of the target data. This triggers the index management service to automatically create the index, improving the flexibility of data lake index creation. Furthermore, since the index is created specifically based on the data characteristics of the inflow data, data retrieval performance can be improved when querying data based on the created data lake index.

[0071] This application provides a method, apparatus, electronic device, and computer storage medium for creating a data lake index. The method for creating a data lake index provided in this application will be described first. Figure 1 As shown in the embodiments of this application, the data lake index creation method includes the following steps:

[0072] S101: When the target data is obtained from the data lake, obtain the dynamic change information of the target data. The dynamic change information is used to map the transaction actions of the target data in the process of entering the data lake.

[0073] S102: Extract the data features of the target data based on the dynamic changes in the target data;

[0074] S103: Create a data lake index based on the data characteristics of the extracted target data.

[0075] This application provides a method, apparatus, electronic device, and computer storage medium for creating a data lake index. When target data is acquired from the data lake, dynamic change information of the target data is obtained, and data features of the target data are extracted based on this dynamic change information. Finally, a data lake index is created based on the extracted data features. In this way, during the dynamic data inflow process, the target data change records are collected and explored in real time, constructing and generating data features of the target data, thereby triggering the index management service to automatically create the index. This improves the flexibility of data lake index creation. Furthermore, because the index is created specifically based on the data features of the inflow data, data retrieval performance can be improved when querying data based on the created data lake index.

[0076] In S101, during the process of the target data being added to the lake in real time and in batches, there will be different transaction actions such as insertion, update, and deletion. The scope, distribution, and type of the target data are also changing dynamically. All of the above behaviors will be mapped in the data lake metadata.

[0077] In one example, during the initial stage of data inflow into the data lake, due to the small amount of data, it is not advisable to create an index based on data characteristics. In one possible implementation, users can manually create an initial index in the background by specifying the index fields, index type, and data sorting method based on business needs and experience. As more data is gradually inflowed, the initial index is dynamically updated by performing data feature analysis on the inflowing data, resulting in a more comprehensive data lake index. It should be noted that the method described above for creating an initial index is not a mandatory step. In practice, other methods can be used to create the initial index, such as directly using the initial user data or using a pre-created regular index as the initial index.

[0078] In S102, changes to the target data are recorded in the dynamic data change information. Then, based on this information, a data change record file is generated. This file includes at least one data change record for the target data, used to record dynamic changes in the target data. In a specific embodiment, this is mapped within the data lake metadata. Change records are pre-written to a log file using WAL (Write-Ahead Log), and then the log file is persisted to the data lake metadata manifestlist.

[0079] In one example, based on the dynamic changes in the target data, data features are extracted, including:

[0080] Generate a data change log file based on the dynamic changes in the target data;

[0081] Based on the data change log file, extract the data features of the target data.

[0082] In one example, a data change log file is generated based on the dynamic changes in the target data, including:

[0083] Record action-type change information and data statistics-type change information of target data in the dynamic change information of data, and generate a data change log file. Action-type change information is used to indicate change actions of inserting, deleting or updating target data, and data statistics-type change information is used to indicate change actions of statistical analysis of target data.

[0084] Specifically, features can be constructed from the data change record file to extract data features of the target data. In one example, the data change record file includes at least one data change record for the target data; the data features extracted from the data change record file include:

[0085] Parse the data change log file to obtain at least one data change record;

[0086] Based on at least one data change record and the type of the target data, feature construction is performed on the target data through automatic feature engineering to obtain the data features of the target data. The type of the target data includes any one of the following: text type, data type, category type, geospatial type, date and time type, and dimension type.

[0087] Automatic feature engineering, acting as a consumer of the message queue, parses and retrieves target data change records in real time. For different types of data in the data lake, automatic feature engineering supports the following feature extraction methods: text (word2vec, tf-idf); numerical (normalized, statistical, offline, etc.); categorical (hash, one-hot); geospatial (latitude and longitude, altitude); date and time (tsfresh); and multidimensional (dimensionality reduction analysis, etc.).

[0088] In one example, automated feature engineering includes at least one feature primitive;

[0089] Based on at least one data change record and the type of the target data, feature construction is performed on the target data using automatic feature engineering to obtain the data features of the target data, including:

[0090] Based on at least one data change record, and for the type of target data, feature construction is performed on the target data using at least one feature primitive in automatic feature engineering to extract the data features of the target data.

[0091] In this process, feature primitives, as the smallest units in automated feature engineering, are divided into aggregation primitives and transformation primitives, with the computational logic for each primitive built into the background. Action-type metadata and data-type metadata together serve as the original dataset in the computation of feature primitives. Data features are constructed by superimposing one or more feature primitives.

[0092] In S103, in one example, a data lake index is created in the data lake if the data characteristics of the target data meet the preset index creation type. In another example, based on the data characteristics of the target data extracted in S102, and according to the preset mapping relationship between data characteristics and trigger index creation type, a data lake index of the target data is created in the data lake when the data characteristics of the target data meet the trigger index creation type.

[0093] In one example, after creating a data lake index based on the data characteristics of the extracted target data, the process also includes:

[0094] Perform data feature analysis on the target data to obtain the data feature analysis results; based on the data feature analysis results of the target data, update the preset data lake index, which is the initial index set for the data lake.

[0095] After manually setting the initial index, as data gradually enters the lake, the initial index can be dynamically updated by performing data feature analysis on the data entering the lake, resulting in a more complete data lake index.

[0096] To improve the retrieval performance of a data lake while maintaining its flexibility, such as... Figure 2 As shown, after S103, the following steps may also be included:

[0097] S201: When a query request is received, the query cost of each of the multiple preset query paths is calculated based on the data lake index, and the preset query path whose query cost meets the preset conditions is selected as the target query path.

[0098] S202: Based on the target query path, search for the data corresponding to the query request.

[0099] By using a cost optimizer to determine the target query path from multiple preset query paths, and then searching for the target query data according to the target query path, the flexibility of the data lake can be guaranteed while improving the retrieval performance of the data lake.

[0100] In step S201, the query cost of each preset query path among multiple preset query paths is calculated based on the data lake index. The preset query path whose query cost meets preset conditions is selected as the target query path. These preset conditions can be set according to actual needs; for example, the preset query path with the lowest query cost can be selected as the target query path. This is not limited here. For instance, if the query cost of index A is less than the query cost of index B during a data query, then the preset query path corresponding to index A is selected as the target query path. In a specific embodiment, the index storage interface design must be backward compatible with the CBO interface in the query engine, enabling the CBO optimizer to calculate the query cost of each preset query path among multiple preset query paths and select the optimal query path.

[0101] In S202, in a specific embodiment, the data lake index structure built based on the above method includes multiple types of indexing methods. When executing an SQL query, when the query engine obtains multiple possible query paths, it can use the CBO to calculate the query path with the lowest cost as the optimal query path to execute the SQL task.

[0102] To facilitate understanding of the data lake index creation method provided in this application, the following specific embodiment will be used to describe the method provided in this application.

[0103] This application proposes an index set management design that extracts data feature distribution in real time and automatically creates and optimizes data lake indexes and data organization to accelerate data lake query retrieval. While ensuring the flexibility and compatibility of the data lake, it narrows the performance gap between the data lake and the dedicated data warehouse.

[0104] In the implementation of the SQL engine, the SQL statement submitted by the client undergoes lexical and syntactic parsing by the parser. The generated syntax tree nodes are then traversed to form an initial logical execution plan. Most engines have built-in rule-based optimization strategies, using a rule optimizer to perform rule replacement and form the logical execution plan. This process is not covered in this embodiment and will not be described in detail here. The logical execution plan generated by the above process will be evaluated for physical execution costs by a cost-based optimizer (CBO). The path with the minimum resource consumption will be selected as the actual physical execution path. Finally, the execution engine will complete the physical data reading, writing, and querying. The query engine will then return the results to the client, completing one data query and analysis process.

[0105] This solution is designed to collect and detect metadata changes in real time during the dynamic data ingestion process, map and extract them as dynamic features of the data, generate key data features through an automatic feature extraction system, and then trigger the index management service to automatically create and delete indexes. This promotes the diversity of index structures in the process of determining the optimal path for data lake query and retrieval by the CBO, ensuring the flexibility of the data lake while improving the retrieval performance of the data lake.

[0106] The technical architecture design of the embodiments of this application is as follows: Figure 3 As shown. The technical implementation details of the embodiments of this application will be described below in four aspects:

[0107] 1. The index interface is opened via SPI, which is compatible with and supports custom development of more index types.

[0108] Data lakes store index files and data files in a file-based manner and expose index interfaces via an SPI (In-Service Provider Interface) model, providing pluggable functionality and compatibility with custom development of more index types. Currently supported common index types include: B+ trees, hash indexes, Bloom filters, bitmap indexes, simple indexes, and HBase indexes.

[0109] In a data lake, since the actual data is stored in files, the distribution of data within the files is usually defined to accelerate queries. Common methods include sorting by maximum / minimum value (Max / Min) or combining sorting by Z-curve (Z-ordering) to reduce data file scanning, trim search space, reduce disk I / O, and improve search efficiency.

[0110] In the initial stage of data ingestion into the data lake, due to the small data volume, it is not advisable to create indexes based on data characteristics. One possible approach is for users to manually create an initial index in the background, specifying the index fields, index type, and data sorting method based on business needs and experience. As more data is ingested, the initial index is dynamically updated through data feature analysis of the ingested data, resulting in a more comprehensive data lake index. Of course, the aforementioned method of creating an initial index is not mandatory. In practice, other methods can be used to create the initial index, such as directly using the initial ingested data or using a pre-created regular index as the initial index.

[0111] 2. Real-time metadata exploration service

[0112] During the process of data being fed into the data lake in real-time and batches, various transaction actions such as insert, update, and delete will occur. The scope, distribution, and type of the data are also dynamically changing. All of these behaviors will be mapped in the data lake metadata. Change records are written to log files in advance through the Write-Ahead Log (WAL), and then the log files are persisted into the data lake metadata manifestlist.

[0113] The metadata probing service will monitor log file changes in real time and use Change Data Capture (CDC) to transform log changes into time-driven data using Debezium and write them to a message queue. Each message is uniquely identified by an event ID.

[0114] 3. Use automated feature engineering to explore metadata distribution and automatically trigger the index management service to run.

[0115] Automatic feature engineering, acting as a consumer of the message queue, parses and retrieves metadata change records in real time. These change records are affected by both action-based changes such as insert, update, and delete, and data statistical changes. Based on the metadata change records, automatic feature engineering methods are used to periodically construct features for the data. For those triggering creation conditions, downstream index management services are scheduled to perform index management, including index creation and deletion.

[0116] For different types of data in a data lake, the automatic feature engineering feature extraction methods support the following:

[0117] Text types: word2vec, tf-idf

[0118] • Numerical data types: normalized values, statistical values, offline values, etc.

[0119] • Category type: hash, one-hot

[0120] • Geographic spatial types: latitude and longitude, altitude

[0121] • Date and time type: tsfresh

[0122] • Multidimensional: Dimensionality reduction analysis, etc.

[0123] In implementation, pandas can be used to abstract metadata tables into entities. Multiple metadata tables constitute multiple entity sets, and the relationships between entities form an entity set.

[0124] Feature primitives, as the smallest units in feature engineering, are divided into aggregation primitives and transformation primitives, with built-in computational logic for each primitive. Action-type metadata and data-type metadata together serve as the original dataset in the computation of feature primitives. Data features are constructed by superimposing one or more feature primitives, implementing the following construction method:

[0125] Aggregation method: Calculate statistics by grouping the parent table with respect to the child table.

[0126] Conversion method: Perform operations on one or more columns in a table.

[0127] The key data features extracted using the above methods reduce redundancy and noise interference in the original data distribution, improving the accuracy of the data feature and index mapping principles below. The mapping relationship between data features and the trigger index creation type is shown in Table 1.

[0128] Table 1: Mapping Relationship Between Data Characteristics and Types that Trigger Index Creation

[0129]

[0130] As more data is added to the lake, data characteristics become more obvious and effective. Combined with the open index customization interface, it is possible to gradually support the creation of indexes that are more in line with data characteristics.

[0131] Meanwhile, as data characteristics change, such as large data being gradually deleted due to business reasons, the real-time metadata exploration service can quickly detect changes in data distribution. Feature engineering will automatically construct the max feature and find values ​​below the index trigger threshold. The index management system will automatically update the min / max index structure to ensure the rationality of the index, reduce the pressure of the index on storage and retrieval, and maintain the index set at a reasonable level.

[0132] 4. Compatible with open-source CBO interface

[0133] As an automated management system for data lake indexes, this system's index storage interface design must be backward compatible with the CBO interface in the query engine. This allows the CBO optimizer to directly determine the query plan and select the optimal query path. The data lake index structure built based on this approach includes various index types. When executing SQL queries, the query engine, upon receiving multiple possible query paths, can select the path with the lowest computational cost using the CBO as the optimal query path to execute the SQL task.

[0134] Based on the data lake index creation method provided in the above embodiments, correspondingly, such as Figure 4 As shown in the figure, this application embodiment provides a data lake index creation apparatus 400, which may include:

[0135] The acquisition module 401 is used to acquire dynamic data change information of the target data when the target data is acquired in the data lake. The dynamic data change information is used to map the transaction actions of the target data in the process of entering the data lake.

[0136] Extraction module 402 is used to extract data features of target data based on the dynamic changes in the target data.

[0137] Create module 403 to create a data lake index for the target data based on the data characteristics of the extracted target data.

[0138] In one embodiment, the extraction module 402 may include:

[0139] The generation unit is used to generate a data change record file based on the dynamic changes in the target data.

[0140] The extraction unit is used to extract the data features of the target data based on the data change record file.

[0141] In one embodiment, the generating unit may be specifically used for:

[0142] Record action-type change information and data statistics-type change information of target data in the dynamic change information of data, and generate a data change log file. Action-type change information is used to indicate change actions of inserting, deleting or updating target data, and data statistics-type change information is used to indicate change actions of statistical analysis of target data.

[0143] In one embodiment, the extraction unit may include:

[0144] The parsing unit is used to parse the data change record file to obtain at least one data change record.

[0145] The construction unit is used to construct features of the target data through automatic feature engineering based on at least one data change record and the type of the target data, thereby obtaining the data features of the target data. The type of the target data includes any one of the following: text type, data type, category type, geospatial type, date and time type, and dimension type.

[0146] In one embodiment, the building unit may specifically be used for:

[0147] Based on at least one data change record, and for the type of target data, feature construction is performed on the target data using at least one feature primitive in automatic feature engineering to extract the data features of the target data.

[0148] In one embodiment, the data lake index creation apparatus 400 may further include:

[0149] The first search module is used to calculate the query cost of each preset query path among multiple preset query paths when a query request is obtained, and select the preset query path whose query cost meets the preset conditions as the target query path.

[0150] The second search module is used to search for the data corresponding to the query request based on the target query path and the data lake index.

[0151] In one embodiment, the data lake index creation apparatus 400 may further include:

[0152] The first update module is used to perform data feature analysis on the target data and obtain the data feature analysis results.

[0153] The second update module is used to update the preset data lake index based on the data feature analysis results of the target data. The preset data lake index is the initial index set for the data lake.

[0154] Based on the data lake index creation method and apparatus provided in the above embodiments, this application also provides an electronic device 500, such as... Figure 5 As shown:

[0155] It includes a processor 501, a memory 502, and a computer program stored in the memory 502 and executable on the processor 501. When the computer program is executed by the processor 501, it implements the various processes of the above-described data lake index creation method embodiment and achieves the same technical effect.

[0156] Specifically, the processor 501 may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.

[0157] Memory 502 may include mass storage for data or instructions. For example, and not limitingly, memory 502 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 502 may include removable or non-removable (or fixed) media. Where appropriate, memory 502 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 502 is non-volatile solid-state memory.

[0158] In certain embodiments, the memory may include read-only memory (ROM), random access memory (RAM), disk storage media devices, optical storage media devices, flash memory devices, and electrical, optical, or other physical / tangible memory storage devices. Thus, typically, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to one aspect of this application.

[0159] The processor 501 implements any of the data lake index creation methods in the above embodiments by reading and executing computer program instructions stored in the memory 502.

[0160] In one example, the electronic device may also include a communication interface 503 and a bus 510. As an example, such as... Figure 5 As shown, the processor 501, memory 502, and communication interface 503 are connected through bus 510 and complete communication with each other.

[0161] The communication interface 503 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application.

[0162] Bus 510 includes hardware, software, or both, that couples components of an online data traffic metering device together. For example, and not limitingly, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 510 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, any suitable bus or interconnect is contemplated herein.

[0163] This application also provides a computer-readable storage medium storing a computer program. When executed by a processor, this computer program implements the various processes of the above-described data lake index creation method embodiments and achieves the same technical effects. To avoid repetition, it will not be described again here. The computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0164] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.

[0165] The functional blocks shown in the above block diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.

[0166] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.

[0167] The aspects of this application have been described above with reference to flowchart illustrations and / or block diagrams of methods, apparatus, and computer program products according to embodiments of this application. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that these instructions, executable via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions / actions specified in one or more blocks of the flowchart illustrations and / or block diagrams. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field-programmable logic circuit. It is also understood that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can also be implemented by dedicated hardware performing the specified functions or actions, or can be implemented by a combination of dedicated hardware and computer instructions.

[0168] The above are merely specific embodiments of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.

Claims

1. A data lake index creation method, characterized by, include: When the target data is acquired in the data lake, the dynamic change information of the target data is acquired. The dynamic change information is used to map the transaction actions of the target data during the process of entering the data lake. Based on the dynamic change information of the target data, extract the data features of the target data; Based on the extracted data features of the target data, a data lake index is created. This creation of the data lake index includes: Perform data feature analysis on the target data to obtain the data feature analysis results; Based on the data feature analysis results of the target data, the preset data lake index is updated. The preset data lake index is the initial index set for the data lake.

2. The method of claim 1, wherein, The step of extracting data features from the target data based on the dynamic change information of the target data includes: Based on the dynamic change information of the target data, a data change record file is generated; The data features of the target data are extracted from the data change record file.

3. The method of claim 2, wherein, The step of generating a data change record file based on the dynamic data change information of the target data includes: Record the action-type change information and data statistics-type change information of the target data in the dynamic change information of the data, and generate the data change record file. The action-type change information is used to indicate the change action of inserting, deleting or updating the target data, and the data statistics-type change information is used to indicate the change action of statistically analyzing the target data.

4. The method according to claim 2, characterized in that, The data change log file includes at least one data change record for the target data; The step of extracting the data features of the target data based on the data change record file includes: Parse the data change record file to obtain the at least one data change record; Based on the at least one data change record and the type of the target data, feature construction is performed on the target data through automatic feature engineering to obtain the data features of the target data. The type of the target data includes any one of text type, data type, category type, geospatial type, date and time type, and dimension type.

5. The method according to claim 4, characterized in that, The automatic feature engineering includes at least one feature primitive; The step of constructing features for the target data through automatic feature engineering based on the at least one data change record and the type of the target data to obtain the data features of the target data includes: Based on the at least one data change record, and for the type of the target data, the system uses at least one feature primitive in the automatic feature engineering to overlay features onto the target data and extract the data features of the target data.

6. The method according to claim 1, characterized in that, After creating the data lake index based on the extracted data features of the target data, the process further includes: Upon receiving a query request, the query cost of each of the multiple preset query paths is calculated based on the data lake index, and the preset query path whose query cost meets the preset conditions is selected as the target query path. Based on the target query path, the data corresponding to the query request is searched.

7. A data lake index creation apparatus, characterized in that, The device includes: The acquisition module is used to acquire dynamic data change information of the target data when the target data is acquired in the data lake. The dynamic data change information is used to map the transaction actions of the target data during the process of entering the data lake. The extraction module is used to extract data features of the target data based on the dynamic change information of the target data; A creation module is used to create a data lake index based on the data features of the extracted target data. The creation of the data lake index based on the data features of the extracted target data includes: Perform data feature analysis on the target data to obtain the data feature analysis results; Based on the data feature analysis results of the target data, the preset data lake index is updated. The preset data lake index is the initial index set for the data lake.

8. An electronic device, characterized in that, The device includes: a processor and a memory storing computer program instructions; When the processor executes the computer program instructions, it implements the data lake index creation method as described in any one of claims 1-6.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer program instructions that, when executed by a processor, implement the data lake index creation method as described in any one of claims 1-6.