Data processing method and device based on lake-warehouse integration and electronic equipment
By pre-storing protocol files corresponding to different data processing engines and parsing and obtaining metadata, the problems of large data storage volume and insufficient security in data query solutions are solved, thereby reducing data storage volume and improving query response speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING VOLCANO ENGINE TECH CO LTD
- Filing Date
- 2023-10-24
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, data query solutions suffer from problems such as large data storage requirements and poor user experience in data querying.
By storing at least two protocol files, each corresponding to a different data processing engine, the protocol file is used to parse the data, store the data, the data storage space retrieves the data, the data processing engine sends a metadata retrieval request and retrieves metadata from the metadata storage space.
This reduces data storage requirements, improves data security and query response speed, and meets users' data query needs.
Smart Images

Figure CN117633320B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a data processing method, apparatus and electronic device based on lake-warehouse integration. Background Technology
[0002] With the development of information technology, data lakes are being used in more and more scenarios. In reality, for users querying data, there may be a large number of data query tasks every day, or queries requiring fast response times and low data volumes to meet user data query needs. However, due to flaws in some data query solutions, the user data query experience is often poor. Summary of the Invention
[0003] This disclosure is provided to briefly introduce the concepts, which will be described in detail in the subsequent Detailed Description section. This disclosure is not intended to identify key or essential features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution.
[0004] In a first aspect, embodiments of this disclosure provide a data processing method based on a lake warehouse, the method comprising: pre-storing at least two protocol files, wherein each protocol file corresponds to a different data processing engine, the protocol file being used to parse a metadata acquisition request sent by its corresponding data processing engine and to acquire metadata from a metadata storage space; receiving a first metadata acquisition request sent by a data processing engine; determining a target protocol file for processing the first metadata acquisition request from the at least two protocol files according to the engine type of the data processing engine that sent the first metadata acquisition request; and parsing the first metadata acquisition request and acquiring the metadata corresponding to the first metadata acquisition request from the metadata storage space based on the target protocol file.
[0005] Secondly, embodiments of this disclosure provide a data processing device based on a lake warehouse, comprising: a storage unit for pre-storing at least two protocol files, wherein each protocol file corresponds to a different data processing engine, and the protocol file is used to parse metadata acquisition requests sent by its corresponding data processing engine and to acquire metadata from a metadata storage space; a receiving unit for receiving a first metadata acquisition request sent by a data processing engine; a determining unit for determining a target protocol file for processing the first metadata acquisition request from the at least two protocol files based on the engine type of the data processing engine that sent the first metadata acquisition request; and a parsing unit for parsing the first metadata acquisition request and acquiring the metadata corresponding to the first metadata acquisition request from the metadata storage space based on the target protocol file.
[0006] Thirdly, embodiments of this disclosure provide an electronic device, including: one or more processors; and a storage device for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the data processing method based on lake warehouse integration as described in the first aspect.
[0007] Fourthly, embodiments of this disclosure provide a computer-readable medium having a computer program stored thereon that, when executed by a processor, implements the steps of the lake-warehouse integrated data processing method as described in the first aspect.
[0008] The data processing method, apparatus, and electronic device based on lake warehouse integration disclosed herein, for scenarios combining two data processing engines, utilize a single set of metadata and pre-store at least two protocol files. Each protocol file corresponds to a different data processing engine and is used to parse metadata retrieval requests sent by its corresponding data processing engine and retrieve metadata from the metadata storage space. The method involves receiving a first metadata retrieval request from a data processing engine; determining a target protocol file to process the first metadata retrieval request from the at least two protocol files based on the engine type of the data processing engine that sent the first metadata retrieval request; and parsing the first metadata retrieval request and retrieving the corresponding metadata from the metadata storage space based on the target protocol file. This allows for a single metadata storage and at least two sets of external protocol implementations, reducing data storage volume. Furthermore, by setting up a single metadata storage, a unified entry point for metadata access management can be established, improving data security. Attached Figure Description
[0009] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale.
[0010] Figure 1 This is a flowchart of an embodiment of the lake-warehouse integrated data processing method according to the present disclosure;
[0011] Figure 2 This is a flowchart of an implementation of the lake-warehouse integrated data processing method according to this disclosure;
[0012] Figure 3 and Figure 4 This is a schematic diagram illustrating an application scenario of the lake-warehouse integrated data processing method disclosed herein;
[0013] Figure 5This is a schematic diagram of the structure of one embodiment of the data processing apparatus according to the present disclosure;
[0014] Figure 6 This is an example of a data processing method based on lake warehouse integration that can be applied to an exemplary system architecture, as described in one embodiment of this disclosure.
[0015] Figure 7 This is a schematic diagram of the basic structure of an electronic device provided according to an embodiment of the present disclosure. Detailed Implementation
[0016] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0017] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.
[0018] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.
[0019] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.
[0020] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0021] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
[0022] In one or more embodiments of this disclosure, the OLAP database (e.g., Doris, StarRocks, Clickhouse, etc.) is primarily designed for real-time data warehouse analysis, while the big data processing engine (e.g., Hive / Spark / Presto, with Spark used as an example) is primarily designed for offline batch data processing and analysis. OLAP, being largely based on an in-memory MPP architecture, offers higher performance than Spark; however, OLAP databases often face stability issues when processing large volumes of data. Therefore, integrating an OLAP database and a big data processing engine is possible. However, integrating these two systems, such as connecting Spark to an OLAP database, increases operational costs, and also introduces the issue of multiple copies of data and metadata. This disclosure primarily describes how to combine the real-time capabilities of OLAP with the offline processing capabilities of Spark, achieving a unified lake warehouse capability by ensuring that data and metadata are stored only once through a unified metadata approach.
[0023] In one or more embodiments of this disclosure, the OLAP database may include a database frontend (FE) and a database backend (BE). In a scenario where Spark is integrated into the OLAP database, the FE of the OLAP database may include an access layer responsible for receiving query statements, parsing the query statements, and then sending the query statements to the corresponding data processing engine, i.e., the data processing engine corresponding to OLAP or the data processing engine corresponding to Spark; the BE of the OLAP database may be the database execution layer and storage layer, used to store the database fact tables.
[0024] In one or more embodiments of this disclosure, the request statements (e.g., metadata retrieval requests) sent by the client to the server's FE are categorized into two types: simple queries with fast response times and low data volume, and complex queries requiring large-scale ETL and involving high data volume. When the request statement type sent by the client to the server's FE is a simple query, the query request is processed based on the OLAP data processing engine; the OLAP data processing engine requests the corresponding metadata from the metadata storage space through the MySQL protocol file. When the request statement type sent by the client to the server's FE is a complex query, the data query request is processed based on the Spark data processing engine; the Spark data processing engine requests the corresponding metadata from the metadata storage space through a protocol file compatible with HiveMetastore.
[0025] In one or more embodiments of this disclosure, user authentication is required when the Spark engine interacts with the OLAP database. An authentication plugin can be pre-installed on the Spark client. During client authentication, the authentication plugin sends predefined verification information, such as user identifier, query request identifier, token validity period, and verification code bound to the electronic device. The verification plugin on the FE side then returns a token to the client. Finally, the client returns the token, interface request (e.g., getTable), and permission information to the FE to complete the authentication operation.
[0026] In one or more embodiments of this disclosure, an access mechanism compatible with the HMS protocol is provided in the OLAP system to ensure that Hadoop systems such as Hive, Spark, and Presto can easily achieve metadata compatibility with the OLAP system. It is understood that this disclosure uses OLAP and Spark as examples, but in actual application scenarios, it can be a combination of any number of data processing engines.
[0027] refer to Figure 1 This illustrates the flow of one embodiment of the lake-warehouse integrated data processing method according to this disclosure. Figure 1 The data processing method based on lake warehouse integration shown includes the following steps:
[0028] Step 101: Pre-store at least two protocol files.
[0029] Here, each protocol file corresponds to a different data processing engine. The protocol file is used to parse the metadata retrieval request corresponding to the data processing engine and retrieve metadata from the metadata storage space.
[0030] Optionally, the protocol file can correspond one-to-one with a data processing engine; alternatively, one protocol file can correspond to multiple data processing engines, meaning one protocol file can be compatible with multiple data processing engines.
[0031] As an example, such as Figure 3As shown, the OLAP FE (Front-End Environment) stores a first protocol and a second protocol, such as the MySQL protocol and a protocol compatible with HiveMetastore (also called the HiveMetastore protocol). The MySQL protocol corresponds to the OLAP data processing engine, and the HiveMetastore protocol corresponds to the Spark data processing engine. The MySQL protocol file is used to parse the metadata retrieval request corresponding to the OLAP data processing engine and retrieve metadata from the metadata storage space, Meta Manager. The HiveMetastore protocol file is used to parse the metadata retrieval request corresponding to the Spark data processing engine and retrieve metadata from the metadata storage space, Meta Manager.
[0032] Step 102: Receive the first metadata retrieval request sent by the data processing engine.
[0033] As an example, such as Figure 3 As shown, the FE (Feature Provider) receives the first metadata retrieval request from the OLAP or Spark data engine to the Meta Manager, the metadata storage space. The node in the OLAP data processing engine used to parse the SQL statement and generate the metadata retrieval request is located in the FE and is not shown in the diagram.
[0034] Step 103: Based on the engine type of the data processing engine that sent the first metadata retrieval request, determine the target protocol file for processing the metadata retrieval request from the at least two protocol files.
[0035] As an example, if the data processing engine sending the first metadata retrieval request is an OLAP data engine, the target protocol file for processing the metadata retrieval request is determined to be the MySQL protocol file from both the MySQL protocol file and the HiveMetastore protocol file. Similarly, if the data processing engine sending the first metadata retrieval request is a Spark data engine, the target protocol file for processing the metadata retrieval request is determined to be the HiveMetastore protocol file from both the MySQL protocol file and the HiveMetastore protocol file.
[0036] Step 104: Based on the target protocol file, parse the first metadata acquisition request and obtain the metadata corresponding to the first metadata acquisition request from the metadata storage space.
[0037] As an example, such as Figure 3As shown, based on whether the target protocol file is a MySQL protocol file or a HiveMetastore protocol file, the first metadata retrieval request is parsed, and the corresponding metadata is retrieved from the metadata storage space. Specifically, if the target protocol file is a MySQL protocol file, after parsing the first metadata retrieval request, the metadata corresponding to the first metadata retrieval request is retrieved from the metadata storage space (Meta Manager) via BE. If the target protocol file is a HiveMetastore protocol file, after parsing the first metadata retrieval request, the metadata corresponding to the first metadata retrieval request is retrieved from the metadata storage space (Meta Manager).
[0038] It should be noted that the data processing method provided in this embodiment, for scenarios combining two data processing engines, stores a set of metadata and pre-stores at least two protocol files. Each protocol file corresponds to a different data processing engine. The protocol file is used to parse the metadata retrieval request sent by its corresponding data processing engine and to retrieve metadata from the metadata storage space; receive a first metadata retrieval request sent by a data processing engine; determine the target protocol file for processing the first metadata retrieval request from the at least two protocol files based on the engine type of the data processing engine that sent the first metadata retrieval request; parse the first metadata retrieval request and retrieve the metadata corresponding to the first metadata retrieval request from the metadata storage space based on the target protocol file; thereby, a single metadata storage and at least two external protocol implementations can be achieved, reducing data storage volume; and, by setting up a single metadata, a unified entry point for metadata access management can be established, improving data security.
[0039] In some embodiments, user authentication is required when the Spark engine interacts with the OLAP database. An authentication plugin can be pre-installed on the Spark client. The Spark client authentication process can be performed as follows: Figure 2 The steps are shown.
[0040] Step 201: In response to receiving a data processing request associated with predefined verification information, verify the first electronic device that sent the data acquisition request based on the predefined verification information.
[0041] The first electronic device could be, for example, a Spark client.
[0042] As an example, refer to Figure 4The FE (Feature Provider) responds to a data processing request sent from the Spark engine that is associated with predefined verification information. Based on the predefined verification information, the FE verifies the first electronic device that sent the data acquisition request. The first electronic device is a client that includes the Spark engine. The predefined verification information includes at least one of the following: UserID, token TTL, verification code (private keys) bound to the first electronic device, and queryID. As an example, queryID is mainly used for query verification, userID is the user who submitted the query, TTL is mainly used to ensure that the token has an expiration time if maliciously obtained, and the primary keys bound to the machine mainly ensure that the query originates from a manageable machine in the cloud.
[0043] Step 202: In response to successful verification, return the first token.
[0044] As an example, refer to Figure 4 When the predefined verification information sent from the first electronic device is verified, the FE returns the first token to the Spark client.
[0045] In response to receiving the first token, the first electronic device sends the first metadata retrieval request bound to the first token.
[0046] Specifically, refer to Figure 4 This includes the Spark engine client sending the first token, the interface request (getTable) bound to the first token, and the permission information (checkPrivilege) to the FE.
[0047] The first electronic device has a pre-installed authentication plugin. When it receives a data processing request instruction, the authentication plugin associates predefined verification information with the data processing request and sends it.
[0048] Correspondingly, the server has a pre-installed verification plugin, which is used to verify the data processing request when it receives a data processing request sent by the authentication plugin.
[0049] In some embodiments, the metadata storage space is located outside the database front end.
[0050] It should be noted that the metadata management, originally built into the FE, is now externalized to a storage space outside the FE to resolve the limitations imposed on the FE from the master node. This allows stateful information to be stored outside the FE master-slave nodes, eliminating the need for query requests to be bound to the FE. In high-frequency operation scenarios (such as insert or update operations), this avoids the performance bottleneck of the FE alone, thus improving operational efficiency.
[0051] In some embodiments, the method further includes: in response to receiving a first query request sent by a client, determining a target data processing engine from at least two data processing engines to process the first query request; sending the first query request to the target data processing engine, wherein the target data processing engine generates the first metadata retrieval request during the parsing of the first query request.
[0052] In some embodiments, the target processing engine queries the fact table of the database for the target corresponding to the first query request based on the acquired metadata.
[0053] As an example, when a user submits a Spark SQL statement on the client, the FE will submit a Spark SQL request. At this time, the Spark client receives the SQL statement and needs to obtain the metadata in the Analyzer. The metadata of the table can be obtained through the HiveMetastore protocol, but authentication and login are required first.
[0054] The metadata authentication provided in this disclosure is a token-based authentication method, not a super administrator authentication method. This authentication method needs to be integrated and injected into the user's code. Therefore, this authentication method can both prevent users from being unable to authenticate their accounts using a super administrator account and ensure the security of the connection through a custom authentication method. Open-source Spark authentication is local authentication, which has relatively low security. Users can modify the code when customizing JAR packages, so this solution adopts server-side authentication.
[0055] Further reference Figure 5 As an implementation of the methods shown in the above figures, this disclosure provides an embodiment of a data processing device based on a lake-warehouse integration, which is similar to... Figure 1 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.
[0056] like Figure 5As shown, the lake-warehouse integrated data processing device of this embodiment includes: a storage unit 501, a receiving unit 502, a determining unit 503, and a parsing unit 504. The storage unit is used to pre-store at least two protocol files, each corresponding to a different data processing engine. The protocol files are used to parse metadata retrieval requests sent by their corresponding data processing engines and to retrieve metadata from the metadata storage space. The receiving unit is used to receive a first metadata retrieval request sent by a data processing engine. The determining unit is used to determine, based on the engine type of the data processing engine that sent the first metadata retrieval request, a target protocol file from the at least two protocol files to process the first metadata retrieval request. The parsing unit is used to parse the first metadata retrieval request based on the target protocol file and to retrieve the metadata corresponding to the first metadata retrieval request from the metadata storage space.
[0057] In this embodiment, the specific processing of the storage unit 501, receiving unit 502, determining unit 503, and parsing unit 504 of the integrated lakehouse data processing device and the resulting technical effects can be referred to respectively. Figure 1 The relevant descriptions of steps 101, 102, 103, and 104 in the corresponding embodiments will not be repeated here.
[0058] In some embodiments, the first metadata acquisition request is bound to a first token; and the device is further configured to: in response to receiving a data processing request associated with predefined verification information, verify the first electronic device that sent the data acquisition request based on the predefined verification information; in response to successful verification, return the first token; and in response to receiving the first token, send the first metadata acquisition request bound to the first token.
[0059] In some embodiments, the predefined verification information includes at least one of the following: user identifier, token duration, and verification code bound to the first electronic device.
[0060] In some embodiments, the first electronic device is pre-installed with an authentication plugin, which is used to associate predefined verification information with the data processing request and send it when a data processing request instruction is received.
[0061] In some embodiments, the metadata storage space is located outside the database front end.
[0062] In some embodiments, the apparatus is further configured to: in response to receiving a first query request sent by a client, determine a target data processing engine from at least two data processing engines to process the first query request; and send the first query request to the target data processing engine, wherein the target data processing engine generates the first metadata retrieval request during the parsing of the first query request.
[0063] In some embodiments, the target processing engine queries the fact table of the database for target data corresponding to the first query request based on the acquired metadata.
[0064] This allows for the integration of OLAP's real-time capabilities and Spark's offline processing capabilities, while ensuring that data and metadata are stored in a single copy through a unified metadata approach, thus achieving lake warehouse integration.
[0065] Please refer to Figure 6 , Figure 6 An exemplary system architecture in which a data processing method based on lake warehouse integration, according to an embodiment of this disclosure, can be applied.
[0066] like Figure 6 As shown, the system architecture may include terminal devices 601, 602, and 603, a network 604, and a server 605. Network 604 serves as the medium for providing a communication link between terminal devices 601, 602, and 603 and server 605. Network 604 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.
[0067] Terminal devices 601, 602, and 603 can interact with server 605 via network 604 to receive or send messages, etc. Various client applications, such as web browsers, search engines, and news apps, can be installed on terminal devices 601, 602, and 603. These client applications can receive user commands and perform corresponding functions, such as adding information to a message based on user instructions.
[0068] Terminal devices 601, 602, and 603 can be either hardware or software. When terminal devices 601, 602, and 603 are hardware, they can be various electronic devices with a display screen and supporting web browsing, including but not limited to smartphones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), MP4 players (Moving Picture Experts Group Audio Layer IV), laptops, and desktop computers, etc. When terminal devices 601, 602, and 603 are software, they can be installed in the aforementioned electronic devices. They can be implemented as multiple software programs or software modules (e.g., software programs or software modules used to provide distributed services) or as a single software program or software module. No specific limitations are imposed here.
[0069] Server 605 can be a server that provides various services, such as receiving information retrieval requests sent by terminal devices 601, 602, and 603, and retrieving the corresponding display information through various methods based on the information retrieval requests. The relevant data for the display information is then sent to terminal devices 601, 602, and 603.
[0070] It should be noted that the lake-warehouse integrated data processing method provided in this embodiment can be executed by a terminal device, and correspondingly, the lake-warehouse integrated data processing device can be installed in terminal devices 601, 602, and 603. Furthermore, the lake-warehouse integrated data processing method provided in this embodiment can also be executed by a server 605, and correspondingly, the lake-warehouse integrated data processing device can be installed in server 605.
[0071] It should be understood that Figure 6 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0072] The following is for reference. Figure 7 It illustrates an electronic device suitable for implementing embodiments of the present disclosure (e.g., Figure 6 The diagram shows the structure of the terminal device or server in this disclosure. The terminal device in this embodiment may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle terminals (e.g., vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. Figure 7The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0073] like Figure 7 As shown, the electronic device may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 701, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 702 or a program loaded from a storage device 708 into a random access memory (RAM) 703. The RAM 703 also stores various programs and data required for the operation of the electronic device 700. The processing unit 701, ROM 702, and RAM 703 are interconnected via a bus 704. An input / output (I / O) interface 705 is also connected to the bus 704.
[0074] Typically, the following devices can be connected to I / O interface 705: input devices 706 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 707 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 708 including, for example, magnetic tapes, hard disks, etc.; and communication devices 709. Communication device 709 allows electronic devices to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 7 Electronic devices with various devices are shown, but it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have instead.
[0075] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 709, or installed from storage device 708, or installed from ROM 702. When the computer program is executed by processing device 701, it performs the functions defined in the methods of embodiments of this disclosure.
[0076] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0077] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.
[0078] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0079] The aforementioned computer-readable medium carries one or more programs. When the aforementioned one or more programs are executed by the electronic device, the electronic device causes the following: It pre-stores at least two protocol files, each corresponding to a different data processing engine. The protocol files are used to parse metadata retrieval requests sent by their corresponding data processing engines to retrieve metadata from a metadata storage space; receives a first metadata retrieval request sent by a data processing engine; determines a target protocol file from the at least two protocol files to process the first metadata retrieval request based on the engine type of the data processing engine that sent the first metadata retrieval request; and parses the first metadata retrieval request and retrieves the metadata corresponding to the first metadata retrieval request from the metadata storage space based on the target protocol file.
[0080] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including but not limited to object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0081] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0082] The units described in the embodiments of this disclosure can be implemented in software or hardware. The names of the units are not necessarily limiting in certain circumstances; for example, a receiving unit can also be described as a "unit that receives requests." The functions described above can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), etc.
[0083] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0084] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
[0085] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.
[0086] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.
Claims
1. A data processing method based on lake-warehouse integration, characterized in that, include: At least two protocol files are pre-stored, each corresponding to a different data processing engine. The protocol files are used to parse the metadata retrieval requests sent by their corresponding data processing engines and to retrieve metadata from the metadata storage space. Receive the first metadata retrieval request sent by the data processing engine; Based on the engine type of the data processing engine that sent the first metadata retrieval request, determine the target protocol file for processing the first metadata retrieval request from the at least two protocol files; Based on the target protocol file, the first metadata retrieval request is parsed and the metadata corresponding to the first metadata retrieval request is retrieved from the metadata storage space; wherein, the metadata storage space is located outside the database front-end, and the metadata is stored only once; The method further includes: In response to receiving a first query request from a client, a target data processing engine for processing the first query request is determined from at least two data processing engines; The first query request is sent to the target data processing engine, wherein the target data processing engine generates the first metadata retrieval request during the parsing of the first query request.
2. The method according to claim 1, characterized in that, The first metadata retrieval request is bound to the first token; as well as The method further includes: In response to receiving a data processing request associated with predefined verification information, the first electronic device that sent the data acquisition request is verified based on the predefined verification information; Upon successful verification, the first token is returned; In response to receiving the first token, the first electronic device sends the first metadata retrieval request bound to the first token.
3. The method according to claim 2, characterized in that, The predefined verification information includes at least one of the following: user identifier, token duration, and verification code bound to the first electronic device.
4. The method according to claim 2, characterized in that, The first electronic device has a pre-installed authentication plugin. When a data processing request instruction is received, the authentication plugin associates predefined verification information with the data processing request and sends it.
5. The method according to claim 1, characterized in that, The target data processing engine queries the fact table in the database for the target data corresponding to the first query request based on the acquired metadata.
6. A data processing device based on a lake-warehouse integration, characterized in that, include: The storage unit is used to pre-store at least two protocol files, each of which corresponds to a different data processing engine. The protocol files are used to parse the metadata retrieval requests sent by their corresponding data processing engines and to retrieve metadata from the metadata storage space. The receiving unit is used to receive the first metadata retrieval request sent by the data processing engine. The determining unit is configured to determine, from the at least two protocol files, the target protocol file for processing the first metadata acquisition request based on the engine type of the data processing engine that sent the first metadata acquisition request; The parsing unit is used to parse the first metadata acquisition request and obtain the metadata corresponding to the first metadata acquisition request from the metadata storage space based on the target protocol file; wherein, the metadata storage space is located outside the database front-end, and the metadata is stored only once; The device is further configured to: in response to receiving a first query request sent by a client, determine a target data processing engine from at least two data processing engines to process the first query request; and send the first query request to the target data processing engine, wherein the target data processing engine generates the first metadata acquisition request during the parsing of the first query request.
7. An electronic device, characterized in that, include: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-5.
8. A computer-readable medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-5.