A data processing method and apparatus
By employing a hash data distribution strategy when the parallelism of upstream and downstream operators is inconsistent, the data to be loaded is determined from the dimension table middleware and cached, which solves the problems of low caching efficiency and high memory consumption in the existing technology, and achieves a higher cache hit rate and framework scalability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING WODONG TIANJUN INFORMATION TECH CO LTD
- Filing Date
- 2022-08-25
- Publication Date
- 2026-06-16
Smart Images

Figure CN115470245B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a data processing method and apparatus. Background Technology
[0002] Flink SQL is a development language designed by Flink (an open-source computing engine) to simplify the computational model and lower the barrier to entry for users of real-time computing. It conforms to the semantics of standard SQL (Structured Query Language). Currently, the open-source Flink framework distributes data using a forward data distribution strategy, meaning that upstream and downstream operators have the same degree of parallelism, and data from upstream subtasks will enter downstream subtasks with the same subtask ID. Furthermore, the open-source Flink framework requires different dimension table implementation schemes for different dimension table middleware.
[0003] In the process of realizing this invention, the inventors discovered at least the following problems in the prior art:
[0004] The inability to distribute data when the parallelism of upstream and downstream operators is inconsistent, and the uncertainty of data distribution rules, result in low cache hit rate, high memory consumption, low caching efficiency, low scalability of open source frameworks, low code reusability, and unsuitability for secondary development. Summary of the Invention
[0005] In view of this, embodiments of the present invention provide a data processing method and apparatus that can distribute data when the parallelism of upstream and downstream operators is inconsistent, support hash data distribution strategies, improve the cache hit rate and caching efficiency of dimension table middleware, reduce memory consumption, improve the scalability and code reusability of open source frameworks, and allow for flexible secondary development.
[0006] To achieve the above objectives, a data processing method is provided according to one aspect of the present invention.
[0007] A data processing method includes: in response to a data processing request, determining whether a filtering strategy has been enabled; in response to the filtering strategy being enabled, using a preset algorithm to determine data to be loaded from a dimension table middleware that satisfies the filtering strategy, and loading the data to be loaded into a cache; in response to the filtering strategy not being enabled, using all data in the dimension table middleware as the data to be loaded, and loading the data to be loaded into the cache; and processing the data to be loaded in the cache to generate a processing result of the data processing request.
[0008] Optionally, the filtering strategy includes a filtering field name and a filtering field value. The step of using a preset algorithm to determine the data to be loaded from the dimension table middleware that satisfies the filtering strategy and loading the data to be loaded into the cache includes: encoding the field value corresponding to the filtering field name in the data of the dimension table middleware according to the preset algorithm to obtain a first field value corresponding to the filtering field name; calculating a second field value corresponding to the filtering field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware; determining the data to be loaded from the data of the dimension table middleware that satisfies the filtering strategy according to the second field value corresponding to the filtering field name and the filtering field value, and loading the data to be loaded into the cache.
[0009] Optionally, the step of encoding the field value corresponding to the filter field name in the data of the dimension table middleware according to a preset algorithm to obtain the first field value corresponding to the filter field name includes: encoding the field value corresponding to the filter field name in the data of the dimension table middleware using a hash algorithm to obtain the first field value corresponding to the filter field name.
[0010] Optionally, calculating the second field value corresponding to the filter field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware includes: performing a modulo operation on the first field value with respect to the parallelism to obtain a remainder, and using the remainder as the second field value corresponding to the filter field name.
[0011] Optionally, determining the data to be loaded that satisfies the filtering strategy from the data in the dimension table middleware based on the second field value corresponding to the filtering field name and the filtering field value includes: adding data with the same second field value in the dimension table middleware to the same dimension table sub-operator based on the second field value corresponding to the filtering field name, wherein the number of dimension table sub-operators is determined by the parallelism of the dimension table operators; determining a target sub-operator from the dimension table sub-operators based on the filtering field value; and determining the data in the target sub-operator as the data to be loaded that satisfies the filtering strategy.
[0012] Optionally, the method further includes, when the data loading strategy is a full loading strategy, using all data in the dimension table middleware as data to be loaded, and loading the data to be loaded into the cache.
[0013] Optionally, the filtering strategy further includes a processing algorithm, wherein processing the data to be loaded in the cache to generate the processing result of the data processing request includes: processing the data to be loaded in the cache through the target sub-operator according to the processing algorithm to generate the processing result of the data processing request.
[0014] Optionally, the data processing request may be a periodic loading request sent via a timer or a preloading request during startup initialization.
[0015] According to another aspect of the present invention, a data processing apparatus is provided.
[0016] A data processing apparatus includes: a filtering strategy determination module, configured to determine whether a filtering strategy has been enabled in response to a data processing request; a first data to be loaded determination module, configured to determine, in response to the enabled filtering strategy, data to be loaded that meets the filtering strategy from a dimension table middleware using a preset algorithm, and load the data to be loaded into a cache; a second data to be loaded determination module, configured to, in response to the disabled filtering strategy, use all data in the dimension table middleware as the data to be loaded, and load the data to be loaded into the cache; and a data processing module, configured to process the data to be loaded in the cache and generate a processing result of the data processing request.
[0017] Optionally, the filtering strategy includes a filtering field name and a filtering field value. The first data to be loaded determination module is further configured to: encode the field value corresponding to the filtering field name in the data of the dimension table middleware according to a preset algorithm to obtain a first field value corresponding to the filtering field name; calculate a second field value corresponding to the filtering field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware; determine the data to be loaded that satisfies the filtering strategy from the data of the dimension table middleware according to the second field value corresponding to the filtering field name and the filtering field value, and load the data to be loaded into the cache.
[0018] Optionally, the first data to be loaded determination module is further configured to: encode the field value corresponding to the filter field name in the data of the dimension table middleware using a hash algorithm to obtain the first field value corresponding to the filter field name.
[0019] Optionally, the first data to be loaded determination module is further configured to: perform a modulo operation on the first field value with respect to the parallelism to obtain a remainder, and use the remainder as the second field value corresponding to the filter field name.
[0020] Optionally, the first data to be loaded determination module is further configured to: add data with the same second field value in the data of the dimension table middleware to the same dimension table sub-operator according to the second field value corresponding to the filter field name, wherein the number of dimension table sub-operators is determined by the parallelism of the dimension table operators; determine the target sub-operator from the dimension table sub-operators according to the filter field value; and determine the data in the target sub-operator as the data to be loaded that satisfies the filter strategy.
[0021] Optionally, the first data to be loaded determination module is further configured to: when the data loading strategy is a full loading strategy, take all the data in the dimension table middleware as the data to be loaded, and load the data to be loaded into the cache.
[0022] Optionally, the filtering strategy further includes a processing algorithm, and the data processing module is further configured to: process the data to be loaded in the cache through the target sub-operator according to the processing algorithm, and generate the processing result of the data processing request.
[0023] Optionally, the data processing request may be a periodic loading request sent via a timer or a preloading request during startup initialization.
[0024] According to another aspect of the present invention, an electronic device is provided.
[0025] An electronic device includes: one or more processors; and a memory for storing one or more programs, which, when executed by the one or more processors, cause the one or more processors to implement the data processing method provided in the embodiments of the present invention.
[0026] According to another aspect of the present invention, a computer-readable medium is provided.
[0027] A computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the data processing method provided in the embodiments of the present invention.
[0028] One embodiment of the above invention has the following advantages or beneficial effects: In response to a data processing request, it determines whether a filtering strategy has been enabled; if the filtering strategy is enabled, a preset algorithm is used to determine the data to be loaded from the dimension table middleware that meets the filtering strategy, and the data to be loaded is loaded into the cache; if the filtering strategy is not enabled, all data in the dimension table middleware is used as the data to be loaded, and the data to be loaded is loaded into the cache; the technical solution of processing the data to be loaded in the cache to generate the processing result of the data processing request enables data distribution when the parallelism of upstream and downstream operators is inconsistent, supports hash data distribution strategies, improves the cache hit rate and cache efficiency of the dimension table middleware, reduces memory consumption, improves the scalability and code reusability of the open-source framework, and allows for flexible secondary development.
[0029] The further effects of the aforementioned unconventional alternative methods will be explained below in conjunction with specific implementation methods. Attached Figure Description
[0030] The accompanying drawings are provided to better understand the invention and are not intended to unduly limit the scope of the invention. Wherein:
[0031] Figure 1 This is a schematic diagram of the main steps of a data processing method according to an embodiment of the present invention;
[0032] Figure 2 This is a flowchart illustrating a data processing method according to an embodiment of the present invention;
[0033] Figure 3 This is a flowchart illustrating a hash algorithm according to an embodiment of the present invention;
[0034] Figure 4 This is a schematic diagram of the architecture of a public interface according to an embodiment of the present invention;
[0035] Figure 5 This is a schematic diagram of the main modules of a data processing apparatus according to an embodiment of the present invention;
[0036] Figure 6 This is an exemplary system architecture diagram in which embodiments of the present invention can be applied;
[0037] Figure 7 This is a schematic diagram of the structure of a computer system suitable for implementing terminal devices or servers of the present invention. Detailed Implementation
[0038] The following description, in conjunction with the accompanying drawings, illustrates exemplary embodiments of the present invention, including various details to aid understanding. These details should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the invention. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0039] Figure 1 This is a schematic diagram of the main steps of a data processing method according to an embodiment of the present invention.
[0040] like Figure 1 As shown, a data processing method according to an embodiment of the present invention mainly includes the following steps S101 to S104.
[0041] Step S101: In response to the data processing request, determine whether the filtering strategy has been enabled.
[0042] Data processing requests can be periodic loading requests sent via timers or preloading requests during startup initialization.
[0043] Figure 2 This is a flowchart illustrating a data processing method according to an embodiment of the present invention.
[0044] like Figure 2As shown, in one embodiment, the data processing request can be a periodic full data loading request during initialization, or a periodic loading request sent via a timer or a preloading request during startup initialization. If the data processing request is a periodic loading request sent via a timer, the data is loaded periodically via the timer; if the data processing request is a preloading request during startup initialization, the data is loaded via preloading.
[0045] Step S102: In response to the fact that the filtering strategy has been enabled, the preset algorithm is used to determine the data to be loaded from the dimension table middleware that meets the filtering strategy, and the data to be loaded is loaded into the cache.
[0046] Step S103: In response to the fact that the filtering strategy is not enabled, all data in the dimension table middleware is used as the data to be loaded, and the data to be loaded is loaded into the cache.
[0047] like Figure 2As shown, in one embodiment, the data loading strategy for each dimension table middleware is obtained. The data loading strategy can be a filtering strategy or a full loading strategy, and data is loaded according to the data loading strategy. Specifically, the filtering strategy can be a hash distribution strategy. A lookup.hash.enable parameter representing the data filtering strategy attribute is added to each dimension table middleware. If the lookup.hash.enable parameter is true, the filtering strategy is enabled, and the data loading strategy for that dimension table middleware is the filtering strategy (i.e., the hash distribution strategy is enabled). If the lookup.hash.enable parameter is false, the filtering strategy is not enabled, and the data loading strategy for that dimension table middleware is the full loading strategy. When the data loading strategy is the full loading strategy, all data in the dimension table middleware is used as the data to be loaded, and the data to be loaded is loaded into the cache. When the data loading strategy is the filtering strategy, the data to be loaded is determined from the dimension table middleware according to the filtering rules of the filtering strategy, and the data to be loaded is loaded into the cache. The filtering rules for the filtering strategy can be implemented by adding Flink's internal RULE (a general term for rules used within Flink SQL to optimize SQL, such as StreamExecLookupHashJoinRule). Custom rules can be defined to optimize the SQL. The added StreamExecLookupHashJoinRule optimizes the StreamExecLookupJoin node (the node used to execute the filtering strategy on the data) and checks if the dimension table middleware has the lookup.hash.enable property. If this property is true, then the hash data distribution strategy is enabled. This optimizes the StreamExecLookupJoin node by adding a StreamExecChange operator (i.e., a dimension table operator provided by Flink SQL) between the input node and the node to control the data loading and filtering strategy. By setting the filtering strategy and controlling whether the dimension table middleware enables the filtering strategy, the functionality of the dimension table can be enriched. The dimension table middleware can be enabled or disabled based on specific business needs; in most business scenarios, enabling the filtering strategy can significantly improve performance.
[0048] In one embodiment, the filtering strategy may include filter field names and filter field values.
[0049] Using a preset algorithm to determine the data to be loaded from the dimension table middleware that meets the filtering strategy, and loading the data to be loaded into the cache, may include: encoding the field values corresponding to the filtering field names in the data of the dimension table middleware according to the preset algorithm to obtain the first field value corresponding to the filtering field name; calculating the second field value corresponding to the filtering field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware; determining the data to be loaded from the data of the dimension table middleware that meets the filtering strategy according to the second field value corresponding to the filtering field name and the filtering field value, and loading the data to be loaded into the cache.
[0050] According to the preset algorithm, the field value corresponding to the filter field name in the data of the dimension table middleware is encoded to obtain the first field value corresponding to the filter field name. This can include: encoding the field value corresponding to the filter field name in the data of the dimension table middleware using a hash algorithm to obtain the first field value corresponding to the filter field name.
[0051] Calculating the second field value corresponding to the filter field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware can include: performing a modulo operation on the first field value with respect to the parallelism to obtain the remainder, and using the remainder as the second field value corresponding to the filter field name.
[0052] Based on the second field value corresponding to the filter field name and the filter field value, determine the data to be loaded that meets the filter strategy from the data in the dimension table middleware. This may include: adding data with the same second field value in the dimension table middleware to the same dimension table sub-operator based on the second field value corresponding to the filter field name, where the number of dimension table sub-operators is determined by the parallelism of the dimension table operators; determining the target sub-operator from the dimension table sub-operators based on the filter field value; and determining the data in the target sub-operator as the data to be loaded that meets the filter strategy.
[0053] Figure 3 This is a flowchart illustrating a hash algorithm according to an embodiment of the present invention.
[0054] like Figure 3 As shown, in one embodiment, the dimension table operator (i.e., task) corresponding to the dimension table middleware includes multiple dimension table sub-operators (i.e., subtasks), and the number of dimension table sub-operators is equal to the parallelism of the dimension table operator. According to a preset algorithm, the field value corresponding to the filter field name in the dimension table middleware is encoded to obtain the first field value corresponding to the filter field name; based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware, the second field value corresponding to the filter field name is calculated; according to the second field value corresponding to the filter field name and the filter field value, the data to be loaded that meets the filtering strategy is determined from the dimension table middleware, and the data to be loaded is loaded into the cache. Figure 3 For example, the dimension table middleware can be MySQL (a relational database management system), and the corresponding dimension table operators are... Figure 3 The dimension table operator 2 has a parallelism of 2 and includes two dimension table sub-operators (subtask0 and subtask1). Operator 1 is the upstream operator of dimension table operator 2, and operator 3 is the downstream operator of dimension table operator 2. The filter field name is "Class". The data in operator 1 has four fields: name, gender, age, and class. For the data in operator 1, the field value corresponding to the filter field name is 1 or 2 (i.e., class 1 or class 2). The field value corresponding to the filter field name is encoded using a hash algorithm to obtain the first field value (i.e., the key value) corresponding to the filter field name. The hash value of 1 is 1, and the hash value of 2 is 2. The remainder of the first field value divided by the parallelism is calculated, and the remainder is used as the second field value corresponding to the filter field name. That is, 1%2 is 1, 2%2 is 0. Therefore, the second field value of class 1 is 1, and the second field value of class 2 is 0. Based on the second field value corresponding to the filter field name, data with the same second field value in the dimension table middleware are added to the same dimension table sub-operator. Specifically, data with a second field value of 1 is added to subtask0, and data with a second field value of 0 is added to subtask0. The filter field value in the filtering strategy can be 0 or 1. When the filter field value can be 0, subtask0 is used as the target sub-operator, and the data in subtask0 is considered as the data to be loaded that meets the filtering strategy, and this data is loaded into the cache. When the filter field value can be 1, subtask1 is used as the target sub-operator, and the data in subtask1 is considered as the data to be loaded that meets the filtering strategy, and this data is loaded into the cache.
[0055] Step S104: Process the data to be loaded in the cache and generate the processing result of the data processing request.
[0056] According to one embodiment of the present invention, the filtering strategy may further include a processing algorithm, and processing the data to be loaded in the cache to generate a processing result of the data processing request may include: processing the data to be loaded in the cache through a target sub-operator according to the processing algorithm to generate a processing result of the data processing request.
[0057] In one embodiment, the processing algorithm may include retrieving key data corresponding to the filter field value, and processing cached data using the key data to generate the processing result of the data processing request. Specifically, with Figure 3For example, when the filter field value is 1, the key data corresponding to filter field value 1 is "55, No. 1". By concatenating the cached data to be cached with the key data, the processing result of the data processing request can be obtained. When the filter field value is 0, the key data corresponding to filter field value 0 is "54, No. 1". By concatenating the cached data to be cached with the key data using the target sub-operator corresponding to the filter field value, the processing result of the data processing request can be obtained. The processing result of the data processing request is the data in operator 3.
[0058] Figure 4 This is a schematic diagram of the architecture of a public interface according to an embodiment of the present invention.
[0059] like Figure 4 As shown, in one embodiment, the common functionalities of multiple dimension table middleware (such as MySQL, HBase, Redis, etc.) associated with the Flink framework are abstracted into a common interface (i.e., BaseLookupTableFunction). Specifically, the common interface BaseLookupTableFunction provides common functionalities. Different dimension table middleware only need to use the eval method of the common interface (a method used to associate data in the current data stream with data in the dimension table middleware) to process data, achieving unified cache management. Dimension table middleware can implement common methods as needed, such as the reloadAllData method (a method used to load data from the dimension table into the cache) to load all data. However, the periodic loading function is handled by the common interface, rather than each dimension table middleware needing to reimplement the periodic loading function. Simultaneously, caching functionality can be implemented through the common interface, eliminating the need for each dimension table middleware to implement cache support individually. Different middleware implement this interface when implementing dimension table functionality and then override the relevant methods, further reducing code duplication and enhancing code readability and scalability.
[0060] This invention, through its preloading function, initializes cached data, reducing the access pressure on the dimension table middleware. Combined with a hash data distribution strategy, this results in even better performance. Preloading is implemented similarly to full loading. Full loading is executed periodically; upon receiving a data access request, it checks the cache. If the data is not in the cache, it is not read from the dimension table middleware. Preloading, however, initializes the cache once during task initialization. Upon receiving a data access request, if the data is not in the cache, it reads it from the dimension table middleware and caches it. Preloading allows the cache to be initialized, reducing the need to access the dimension table middleware and then put the data into the cache each time by reading the data once, thus reducing the access pressure on the external middleware.
[0061] Figure 5This is a schematic diagram of the main modules of a data processing apparatus according to an embodiment of the present invention.
[0062] like Figure 5 As shown, a data processing device 500 according to an embodiment of the present invention mainly includes: a filtering strategy judgment module 501, a first data to be loaded determination module 502, a second data to be loaded determination module 503, and a data processing module 504.
[0063] The filtering strategy judgment module 501 is used to determine whether a filtering strategy has been enabled in response to a data processing request.
[0064] The first data to be loaded determination module 502 is used to determine the data to be loaded that meets the filtering strategy from the dimension table middleware in response to the filtering strategy being enabled, and load the data to be loaded into the cache.
[0065] The second data to be loaded determination module 503 is used to determine all data in the dimension table middleware as data to be loaded in response to the fact that the filtering strategy is not enabled, and load the data to be loaded into the cache.
[0066] The data processing module 504 is used to process the data to be loaded in the cache and generate the processing result of the data processing request.
[0067] In one embodiment, the filtering strategy may include a filtering field name and a filtering field value. The first data to be loaded determination module 502 is specifically used to: encode the field value corresponding to the filtering field name in the data of the dimension table middleware according to a preset algorithm to obtain the first field value corresponding to the filtering field name; calculate the second field value corresponding to the filtering field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware; determine the data to be loaded that meets the filtering strategy from the data of the dimension table middleware according to the second field value corresponding to the filtering field name and the filtering field value, and load the data to be loaded into the cache.
[0068] In one embodiment, the first data to be loaded determination module 502 is specifically used to: encode the field value corresponding to the filter field name in the data of the dimension table middleware using a hash algorithm to obtain the first field value corresponding to the filter field name.
[0069] In one embodiment, the first data to be loaded determination module 502 is specifically used to: perform a modulo operation on the first field value with respect to the parallelism to obtain the remainder, and use the remainder as the second field value corresponding to the filter field name.
[0070] In one embodiment, the first data to be loaded determination module 502 is specifically used to: add data with the same second field value in the dimension table middleware to the same dimension table sub-operator according to the second field value corresponding to the filter field name, wherein the number of dimension table sub-operators is determined by the parallelism of the dimension table operators; determine the target sub-operator from the dimension table sub-operators according to the filter field value; and determine the data in the target sub-operator as the data to be loaded that satisfies the filter strategy.
[0071] In one embodiment, the first data to be loaded determination module 502 can also be used to: when the data loading strategy is a full loading strategy, take all the data in the dimension table middleware as the data to be loaded, and load the data to be loaded into the cache.
[0072] In one embodiment, the filtering strategy may further include a processing algorithm, and the data processing module 504 is specifically used to: process the data to be loaded in the cache through the target sub-operator according to the processing algorithm, and generate the processing result of the data processing request.
[0073] In one embodiment, the data processing request can be a periodic loading request sent via a timer or a preloading request during startup initialization.
[0074] Furthermore, the specific implementation details of the data processing device in the embodiments of the present invention have been described in detail in the above data processing method, so the details will not be repeated here.
[0075] Figure 6 An exemplary system architecture 600 is shown that can be applied to the data processing method or data processing apparatus of the present invention.
[0076] like Figure 6 As shown, system architecture 600 may include terminal devices 601, 602, and 603, a network 604, and a server 605. Network 604 serves as the medium for providing communication links between terminal devices 601, 602, and 603 and server 605. Network 604 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.
[0077] Users can use terminal devices 601, 602, and 603 to interact with server 605 via network 604 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 601, 602, and 603, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social media platform software, etc. (for example only).
[0078] Terminal devices 601, 602, and 603 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.
[0079] Server 605 can be a server providing various services, such as a backend management server supporting shopping websites browsed by users using terminal devices 601, 602, and 603 (for example only). The backend management server can respond to received data processing requests by determining whether a filtering strategy is enabled. If the filtering strategy is enabled, it uses a preset algorithm to determine the data to be loaded from the dimension table middleware that meets the filtering strategy and loads the data to be loaded into the cache. If the filtering strategy is not enabled, it uses all data in the dimension table middleware as the data to be loaded and loads it into the cache. It then processes the cached data to be loaded, generates the processing result of the data processing request, and feeds back the processing result (e.g., data processing result – for example only) to the terminal device.
[0080] It should be noted that the data processing method provided in the embodiments of the present invention is generally executed by server 605, and correspondingly, the data processing device is generally located in server 605.
[0081] It should be understood that Figure 6 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0082] The following is for reference. Figure 7 It shows a schematic diagram of the structure of a computer system 700 suitable for implementing terminal devices or servers of the present invention. Figure 7 The terminal device or server shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present invention.
[0083] like Figure 7 As shown, the computer system 700 includes a central processing unit (CPU) 701, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 702 or programs loaded from storage section 708 into random access memory (RAM) 703. The RAM 703 also stores various programs and data required for the operation of the system 700. The CPU 701, ROM 702, and RAM 703 are interconnected via a bus 704. An input / output (I / O) interface 705 is also connected to the bus 704.
[0084] The following components are connected to the I / O interface 705: an input section 706 including a keyboard, mouse, etc.; an output section 707 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 708 including a hard disk, etc.; and a communication section 709 including a network interface card such as a LAN card, modem, etc. The communication section 709 performs communication processing via a network such as the Internet. A drive 710 is also connected to the I / O interface 705 as needed. A removable medium 711, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 710 as needed so that computer programs read from it can be installed into the storage section 708 as needed.
[0085] In particular, according to the embodiments disclosed in this invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 709, and / or installed from removable medium 711. When the computer program is executed by central processing unit (CPU) 701, it performs the functions defined above in the system of this invention.
[0086] It should be noted that the computer-readable medium shown in this invention can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this invention, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this invention, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0087] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0088] The modules described in the embodiments of the present invention can be implemented in software or hardware. The described modules can also be located in a processor; for example, a processor can be described as including a data loading strategy acquisition module, a data to be loaded determination module, and a data processing module. The names of these modules do not necessarily limit the module itself; for example, a filtering strategy determination module can also be described as "used to determine whether a filtering strategy has been enabled in response to a data processing request."
[0089] In another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or it may exist independently and not assembled into the device. The computer-readable medium carries one or more programs, which, when executed by the device, cause the device to include: in response to a data processing request, determining whether a filtering strategy has been enabled; in response to the filtering strategy being enabled, using a preset algorithm to determine data to be loaded that satisfies the filtering strategy from the dimension table middleware, and loading the data to be loaded into a cache; in response to the filtering strategy not being enabled, using all data in the dimension table middleware as data to be loaded, and loading the data to be loaded into the cache; processing the data to be loaded in the cache, and generating a processing result for the data processing request.
[0090] According to the technical solution of this embodiment of the invention, in response to a data processing request, it is determined whether a filtering strategy has been enabled; if the filtering strategy has been enabled, a preset algorithm is used to determine the data to be loaded from the dimension table middleware that meets the filtering strategy, and the data to be loaded is loaded into the cache; if the filtering strategy has not been enabled, all data in the dimension table middleware is used as the data to be loaded, and the data to be loaded is loaded into the cache; the data to be loaded in the cache is processed to generate the processing result of the data processing request. This method enables data distribution even when the parallelism of upstream and downstream operators is inconsistent, supports hash data distribution strategies, improves the cache hit rate and cache efficiency of the dimension table middleware, reduces memory consumption, improves the scalability and code reusability of the open-source framework, and allows for flexible secondary development.
[0091] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can occur depending on design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A data processing method, characterized in that, include: In response to a data processing request, determine whether a filtering strategy has been enabled based on the parameters of the data filtering strategy attribute in the dimension table middleware. In response to the filtering strategy being enabled, a preset algorithm is used to determine the data to be loaded from the dimension table middleware that meets the filtering strategy, and the data to be loaded is loaded into the cache. In response to the fact that the filtering strategy is not enabled, all data in the dimension table middleware is used as the data to be loaded, and the data to be loaded is loaded into the cache; The data to be loaded in the cache is processed to generate the processing result of the data processing request; The filtering strategy includes filtering field names and filtering field values. The step of using a preset algorithm to determine the data to be loaded from the dimension table middleware that satisfies the filtering strategy and loading the data to be loaded into the cache includes: encoding the field value corresponding to the filtering field name in the data of the dimension table middleware according to the preset algorithm to obtain a first field value corresponding to the filtering field name; calculating a second field value corresponding to the filtering field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware; adding data with the same second field value in the dimension table middleware to the same dimension table sub-operator according to the second field value corresponding to the filtering field name, the number of dimension table sub-operators being determined by the parallelism of the dimension table operator; determining a target sub-operator from the dimension table sub-operators according to the filtering field value; determining the data in the target sub-operator as the data to be loaded that satisfies the filtering strategy, and loading the data to be loaded into the cache.
2. The method according to claim 1, characterized in that, The step of encoding the field value corresponding to the filter field name in the data of the dimension table middleware according to a preset algorithm to obtain the first field value corresponding to the filter field name includes: The field value corresponding to the filter field name in the data of the dimension table middleware is encoded by a hash algorithm to obtain the first field value corresponding to the filter field name.
3. The method according to claim 1, characterized in that, The step of calculating the second field value corresponding to the filter field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware includes: The remainder is obtained by performing a modulo operation on the first field value with respect to the parallelism, and the remainder is used as the second field value corresponding to the filter field name.
4. The method according to claim 1, characterized in that, The filtering strategy also includes a processing algorithm, wherein processing the data to be loaded in the cache to generate the processing result of the data processing request includes: According to the processing algorithm, the target sub-operator processes the data to be loaded in the cache to generate the processing result of the data processing request.
5. The method according to claim 1, characterized in that, The data processing request is either a periodic loading request sent via a timer or a preloading request during startup initialization.
6. A data processing apparatus, characterized in that, include: The filtering strategy determination module is used to respond to data processing requests and determine whether a filtering strategy has been enabled based on the parameters of the data filtering strategy attribute in the dimension table middleware. The first data to be loaded determination module is used to determine the data to be loaded that meets the filtering strategy from the dimension table middleware in response to the fact that the filtering strategy has been enabled, and load the data to be loaded into the cache. The second data to be loaded determination module is used to, in response to the fact that the filtering strategy is not enabled, take all the data in the dimension table middleware as the data to be loaded and load the data to be loaded into the cache. The data processing module is used to process the data to be loaded in the cache and generate the processing result of the data processing request; The filtering strategy includes filtering field names and filtering field values. The first data to be loaded determination module is further configured to: encode the field value corresponding to the filtering field name in the data of the dimension table middleware according to a preset algorithm to obtain the first field value corresponding to the filtering field name; and calculate the second field value corresponding to the filtering field name based on the first field value and the parallelism of the dimension table operator corresponding to the dimension table middleware. Based on the second field value corresponding to the filter field name, data with the same second field value in the dimension table middleware are added to the same dimension table sub-operator, and the number of dimension table sub-operators is determined by the parallelism of the dimension table operators; a target sub-operator is determined from the dimension table sub-operators based on the filter field value; the data in the target sub-operator is determined as the data to be loaded that satisfies the filter strategy, and the data to be loaded is loaded into the cache.
7. An electronic device, characterized in that, include: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-5.
8. A computer-readable medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-5.