Calculation method and apparatus

A calculation method and technology for configuration files, applied in the field of big data, can solve problems such as inability to meet requirements, unfixed content, and high performance requirements, and achieve the effect of flexible and convenient configuration process, improved development efficiency, and reduced workload.

Inactive Publication Date: 2018-12-28
广东惠禾科技发展有限公司
3 Cites 6 Cited by

AI-Extracted Technical Summary

Problems solved by technology

This is a task that is fixed for a business structure, but the specific content is not fixed. It is too cumbersome and cannot meet the principle of writing once and using it multiple t...
View more

Abstract

The invention relates to the technical field of big data, and provides a computing method and a device. The method comprises the following steps of: reading and parsing a configuration file, wherein the content of the configuration file comprises a plurality of nodes and at least one edge connected with the plurality of nodes, wherein each node is used for representing a data processing unit in abusiness process, and each edge is used for representing a data flow direction between two nodes; creating a plurality of nodes and constructing a directed acyclic graph representing a business process based on the plurality of nodes and at least one edge, wherein data processing operations corresponding to each node and each node are defined in a pre-generated package; The data processing operation corresponding to each node is executed according to the data flow in the directed acyclic graph until the data processing operation corresponding to each node is completed. When computing programsare developed for different business requirements, only configuration files need to be modified, and no changes need to be made to the code in the package, which significantly improves development efficiency.

Application Domain

Special data processing applications

Technology Topic

Data processingDirect acyclic graph +6

Image

  • Calculation method and apparatus
  • Calculation method and apparatus
  • Calculation method and apparatus

Examples

  • Experimental program(4)

Example

[0068] First embodiment
[0069] figure 2 It shows a flowchart of the calculation method provided by the first embodiment of the present invention. This calculation method can be applied, but is not limited to being applied to Spark programs. In the following description, the application of the method in the Spark program is taken as an example for illustration, but it does not constitute a limitation on the protection scope of the present invention. Reference figure 1 , Calculation methods include:
[0070] Step S10: The processor of the electronic device reads and parses the configuration file.
[0071] The configuration file is configured for business requirements, and the business requirements referred to here are usually a data processing task.
[0072] In the Spark program, after the SparkContext is initialized, use the shell command to pass in the storage location of the configuration file on HDFS, use IO to read the configuration file, and parse its content according to the format of the configuration file, where the configuration file can be, But not limited to formats such as json.
[0073] The content of the configuration file includes multiple nodes and at least one edge connecting multiple nodes, where each node represents a data processing unit in the business process, and each edge represents the data flow between two nodes. For each node In other words, the configuration file also defines the parameters used by the node to complete the corresponding data processing operation.
[0074] In actual implementation, the configuration file can be manually written or automatically generated visually. For example, in the visual editing interface, the user only needs to draw nodes and connect the nodes, and the configuration file can be automatically generated according to the user's drawing results.
[0075] Step S11: The processor of the electronic device creates multiple nodes, and constructs a directed acyclic graph for representing the business process based on the multiple nodes and at least one edge.
[0076] Before step S10 is executed, the node and the data processing operation corresponding to the node are first defined in the program source file, and then the source file is packaged into a program package for use in the method provided in the embodiment of the present invention. In the Spark program, the source file is developed using java, so the program package is a jar package.
[0077] In step S11, multiple nodes can be created according to the definition in the program package. It should be pointed out that the node creation referred to in step S11 refers to the object corresponding to the node created. In an implementation of the first embodiment, the nodes include at least two types, one is a data source node, and the other is an action node.
[0078] The data source node is used to read and output data from the data source based on the data source parameters specified in the configuration file. Data source parameters may include, but are not limited to, data source type, data path or table name, field, field type and other parameters. Among them, for the Spark program, the data source type can generally include two types of Hive tables and HDFS files, which correspond to different data sources.
[0079] The action node is used to perform arithmetic processing on the data based on the action parameters specified in the configuration file. Action parameters can include, but are not limited to, the type of action, the fields involved in the operation, and the constraints that the fields meet. Among them, according to different action types, action nodes can include at least conditional filter nodes, spatiotemporal filter nodes, frequency statistics nodes, field filter nodes, field splicing nodes, intersection nodes, union nodes, subtraction nodes, and save nodes. For different types of action nodes, the corresponding data processing operations will be described in detail later.
[0080] According to the edges in the configuration file, the in-degree and out-degree of each node can be judged. The data source node is the data source node if the in-degree is 0, and the action node is the in-degree greater than 0. The number of data source nodes can be one or more, and the number of action nodes can also be one or more. At the same time, based on the information of nodes and edges, a directed acyclic graph containing nodes and edges can be constructed, and the directed acyclic graph represents the entire business process. image 3 Shows the directed acyclic graph provided by the first embodiment of the present invention, refer to image 3 , image 3 A total of 3 data source nodes are included. The data source type of DataNode1 and DataNode3 is Hive table, and the data source type of DataNode2 is HDFS file. image 3 It also includes 8 action nodes, covering action node types other than spatiotemporal filtering nodes. The connecting lines with arrows between the nodes indicate the flow of data. Understandable, image 3 Just an example, image 3 The shown structure of the directed acyclic graph is only for a specific service, and does not constitute a limitation on the protection scope of the present invention.
[0081] Step S12: The processor of the electronic device executes the data processing operation corresponding to each node according to the data flow direction in the directed acyclic graph, until the data processing operation corresponding to each node is executed.
[0082] First introduce three global hash tables that may be used in step S12.
[0083] The first hash table preMap is of type HashMap[String, ListBuffer[String]], the key of preMap is the identification (id) of each action node, and the value is the ListBuffer formed by the predecessor node of each action node. PreMap can quickly access the predecessor node of each action node. The preMap can be constructed before starting to execute the data processing operation corresponding to the action node.
[0084] In Spark, data is encapsulated in the form of RDD, and the rdd (representing a specific RDD object) output by each node is in the form of RDD[String]. Therefore, if you want to refine the rdd operation, you need to know how to split each row in the rdd again. Metadata can do this. Create a metadata for each rdd, represented by a variable metaMap of type HashMap[String, Int], where key is the field name, and value is the position (index) of each row of the field in the rdd. Parse the fields in rdd in order, start numbering from 0, and save the field names and numbers to metaMap.
[0085] The second hash table resutRddMap is of type HashMap[String, RDD[String]], the key of resutRddMap is the identification of each node, and the value is the rdd output by each node.
[0086] The third hash table resutRddMetaMap is of type HashMap[String, HashMap[String, Int]], resutRddMetaMap is the identification of each node, and value is the metadata of the rdd output by each node.
[0087] Through resutRddMap and resutRddMetaMap, you can access the rdd output by the node and operate on the rdd. The resutRddMap and resutRddMetaMap can be updated after the data processing operation of each node is completed.
[0088] In an implementation of the first embodiment, step S12 may specifically include the following steps:
[0089] A. Execute the data processing operation corresponding to each data source node, and mark the execution status of each data source node as execution completed after execution.
[0090] The data processing operation to be performed by the data source node is mainly the operation of reading data from the data source. In the Spark program, use sparkSQL to read data.
[0091] Specifically, if the data source type is a Hive table, you can directly use the Hive context variable HiveContext.sql function in Spark to execute SQL statements for reading. Convert the read DataFrame into RDD[Row] type rdd (lowercase rdd represents a specific RDD object), where each row represents a row of the Hive table, and the value of each field can be obtained through get(index) , Perform map traversal on rdd, replace each line with a string formed by all the fields, and use the escape character "\001" to splice between the fields, so that the Hive table is converted into RDD[String] type rdd .
[0092] If the data source type is an HDFS file, you can use the sparkContext.textFile(URL) function to read the HDFS file. If the fields of each line in the HDFS file are not separated by "\001", then replace the original separator with "\ 001", the result of reading the HDFS file is still an rdd of type RDD[String].
[0093] After the data source node finishes reading the data, save the output rdd to resutRddMap, and save the metadata corresponding to the rdd to resutRddMetaMap, and mark its execution status as executed.
[0094] The processing method of each data source node is the same. After the data source node is processed, the processing of the action node starts, including step B to step D.
[0095] B. Select an executable action node.
[0096] An executable action node is defined as an action node that has not yet executed the corresponding data processing operation, and the execution status of its own predecessor nodes are all marked as executed. Since the execution state of the data source node has been marked as executed, the action node connected only to the data source node must be an executable action node.
[0097] Sometimes, there are multiple executable action nodes among the action nodes that have not yet performed the corresponding data processing operation, which can be selected randomly or according to a certain predetermined rule (for example, the one with a smaller number) is selected for execution. , The selection result of the executable action node does not affect the final execution result of the directed acyclic graph. Alternatively, in some implementations, data processing operations corresponding to multiple executable action nodes may also be executed concurrently.
[0098] Among them, the predecessor node of the action node can be accessed through preMap.
[0099] C. Execute the data processing operation corresponding to the executable action node, and mark the execution status of the executable action node as execution completed after the execution is completed.
[0100] The data processing operation of the action node is defined in the program package, and the data processing operation in the executable action node can be realized by combining the configured action parameters in the configuration file. The input data of each action node is the data output after the execution of its predecessor node, which is called the input source. By accessing resutRddMap and resutRddMetaMap, you can obtain the rdd and its metadata as the input source, and then process it by the action node. After execution, save the output rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap. Mark its execution status as completed.
[0101] In the specific implementation, you can create an ActionRunner object and pass in the action node as its construction parameter, and execute the data processing operation corresponding to the action node by running the ActionRunner.
[0102] D. Repeat steps A to C until the data processing operations corresponding to each action node are executed.
[0103] Note that because the execution status of the action node is marked as completed in step C, when step A is executed again, the marked action node will be excluded from the action node that has not yet performed the corresponding data processing operation, and at the same time The node execution status is also used when judging whether the data processing operation corresponding to the predecessor node of an action node is completed.
[0104] Combine below image 3 , Specifically introduce each action node:
[0105] After the execution of the three data source nodes is completed, the only action nodes that have completed execution of all predecessor nodes are ActionNode1 and ActionNode3. Randomly select one of these two action nodes to start running, assuming ActionNode1 runs first. ActionNode1 is a conditional filter node. The conditional filter node is used to filter the specified fields in the input source based on the specified conditions and output the processed data. Its function is similar to the where condition filtering in SQL.
[0106] Specifically, the data processing operation corresponding to ActionNode1 includes the following steps:
[0107] (1) Obtain the rdd and metadata output by the precursor node, that is, the rdd and metadata of DataNode1.
[0108] (2) Traverse the specified fields in ActionNode1 (specified in the configuration file), and parse out the filter conditions corresponding to each field. Among them, the filter conditions are divided into the following types:
[0109] Greater than (bigger): mainly for numeric fields or uniformly formatted characters for size comparison.
[0110] Smaller: mainly for numeric fields or uniformly formatted characters for size comparison.
[0111] Equal (equal): mainly for numeric fields or uniformly formatted characters for size comparison.
[0112] Contain: Mainly for fuzzy matching of string type fields.
[0113] Between (between): mainly for numeric fields or uniformly formatted characters for continuous range matching.
[0114] One (in): Mainly perform discrete range matching.
[0115] (3) For each specified field, use the filter method of rdd to filter to obtain the rdd that meets the condition, and the temporary result rdd is used as the input of the next specified field. When the specified field is not the first, the temporary result rdd of the last specified field is used as input, and the filter method is used to filter. When the traversal is completed, the rdd obtained is the rdd that satisfies all field filter conditions.
[0116] (4) If there are other operations, such as limit operations similar to SQL, you can use the take method of rdd, and then use the sc.makeRDD() method to convert the result to rdd.
[0117] (5) At this point, the data processing operation corresponding to ActionNode1 is completed. Because conditional filtering does not change the number and order of fields in rdd, but only filters out rows that do not meet the filtering conditions, the metadata of ActionNode1 is the same as DataNode1. Save the final rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap, and mark the execution status of ActionNode1 as completed.
[0118] The spatio-temporal filtering node is used to filter the time field and/or location field in the input source based on specified conditions and output the processed data. The spatio-temporal filtering node can be regarded as a special case of the conditional filtering node, which will not be elaborated here.
[0119] The only action nodes that all predecessor nodes have executed are ActionNode2 and ActionNode3. Randomly select one of these two action nodes to start running, assuming ActionNode2 runs first. ActionNode2 is a frequency statistics node, used to count the number of contents with the same value in the specified fields in the input source, and output the processed data. Its function is similar to the where condition filtering first and then count counting in SQL.
[0120] Specifically, the data processing operation corresponding to ActionNode2 includes the following steps:
[0121] (1) Obtain the rdd and metadata output by the precursor node, that is, the result rdd and metadata of ActionNode1.
[0122] (2) The frequency statistics need to use the specified field as the key. Only if the key is the same, it will be counted once. If there are multiple specified fields, multiple fields must be consistent before they are considered the same.
[0123] (3) Traverse the specified field, each traversal takes the result rdd of ActionNode1 as input. Record each line in the rdd, interrupted by the separator "\001", and then only use this field as each line of the rdd according to the metadata. After this loop, you will get a set of rdd whose content is only the corresponding field, and the number of rdd is the number of specified fields.
[0124] (4) Traverse this group of rdd. Set up an intermediate variable rddTmp, when rdd corresponds to the first field, rddTmp=rdd. Otherwise, rddTmp=rddTmp.zip(rdd).map(x=>x._1+"\001"+x._2), and finally an rdd with only the specified field is obtained.
[0125] (5) Perform groupBy(x=>x).mapValues(_.size).map(x=>x._1+"\001"+x._2) operation on the above rdd, you can get a field containing the specified and The rdd of the frequency field.
[0126] (6) If there are other operations, such as limit operations similar to SQL, you can use the take method of rdd, and then use the sc.makeRDD() method to convert the result to rdd.
[0127] (7) So far, ActionNode2 has finished running. The fields of ActionNode2 have changed. As a result, there are only specified fields and corresponding frequencies in rdd, so the metadata should be specified fields, frequency fields and their corresponding indexes. Save the final rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap, and mark the execution status of ActionNode2 as completed.
[0128] The only action node that all predecessor nodes have executed is ActionNode3, so ActionNode3 is executed. ActionNode3 is a field filtering node, which is used to filter out the specified fields in the input source and output the processed data. Its function is similar to the query field operation in SQL.
[0129] Specifically, the data processing operation corresponding to ActionNode3 includes the following steps:
[0130] (1) Get the rdd and metadata output by the precursor node, that is, the rdd and metadata of DataNode2.
[0131] (2) Traverse the specified fields, each traversal takes the rdd output by DataNode2 as input, and filters out a specified field each time. After the traversal is complete, you will get a set of RDDs with only the specified fields, and each RDD is just a column of content for one field.
[0132] (3) Traverse this group of rdd. Set up an intermediate variable rddTmp, when rdd corresponds to the first field, rddTmp=rdd. Otherwise, rddTmp=rddTmp.zip(rdd).map(x=>x._1+"\001"+x._2), and finally an rdd with only the specified field is obtained.
[0133] (4) So far, ActionNode3 has finished running. As a result, there are only specified fields in the rdd, so the metadata should be the specified fields and their corresponding indexes. Save the final rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap, and mark the execution status of ActionNode3 as completed.
[0134] The only action node that all predecessor nodes have executed is ActionNode4, so ActionNode4 is executed. ActionNode4 is an intersection node, which is used to internally connect multiple input sources and output processed data with the field specified by each input source in multiple input sources as the connection key. Its function is similar to the internal connection operation in SQL .
[0135] Specifically, the data processing operation corresponding to ActionNode4 includes the following steps:
[0136] (1) Intersection is an operation between multiple RDDs. First get the result rdd and metadata of ActionNode1 and ActionNode3. Each predecessor node of the intersection operation must have a specified field. Different predecessor nodes have different fields. In the intersection operation, only if the values ​​of all the specified fields of multiple input sources are equal, it can proceed.
[0137] (2) Traverse the result rdd of each predecessor node, traverse each specified field internally, filter out the corresponding content of each field, as a separate rdd, after two-level traversal, you will get x groups of rddList, in each rddList Contains y rdd. Among them, x represents the number of predecessor nodes, and y represents the number of fields specified by each predecessor Node.
[0138] (3) Traverse multiple rddLists, perform a zip operation (zip operation) on the rdd in the same rddList, and then convert the result from the form of Tuple(x1, x2) to the form of String. The specific method is when the rdd is in the rddList For the first rdd, rddTmp=rdd, otherwise, rddTmp=rddTmp.zip(rdd).map(x=>x._1+"\001"+x._2), and finally get a group of rdd with only the specified fields.
[0139] (4) Traverse this group of rdd, and let the result rdd of each rdd and its corresponding predecessor node perform a zip operation to form a form of Tuple (joinkey, line). Finally, a set of uniform RDDs are generated, the format of which is RDD[Tuple[String, String]].
[0140] (5) Traverse the set of rdd generated above and perform join operations in turn. Specifically, it is rdd1.join(rdd2).mapValues(x=>x._1+"\001"+x._2). After the traversal is completed, an rdd with only reserved fields is finally obtained.
[0141] (6) Restore the above rdd format to RDD[String] through rdd.map(_._2).
[0142] (7) So far, ActionNode4 has finished running. The result is a reserved field in rdd, so the metadata should be a reserved field and its corresponding index. Save the final rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap, and mark the execution status of ActionNode4 as completed.
[0143] The only action nodes that all predecessor nodes have executed are ActionNode5 and ActionNode6. Randomly select one of these two action nodes to start running, assuming ActionNode6 runs first. ActionNode6 is a union node, which is used to merge the contents of multiple input sources and output processed data. Its function is similar to the union operation in SQL.
[0144] Specifically, the data processing operation corresponding to ActionNode6 includes the following steps:
[0145] (1) The union operation requires that the rdd format between multiple input sources is consistent, that is, the number of separators in each line of the result rdd must be the same to ensure the smooth progress of the union operation. First, get the results of all predecessor nodes rdd and Metadata, that is, the result rdd and metadata of DataNode3 and ActionNode4.
[0146] (2) Traversal result rdd, when traversal is the first time, rddTmp=rdd, otherwise, rddTmp=rddTmp.union(rdd), and finally get an rdd.
[0147] (3) So far, ActionNode6 has finished running. As a result, the metadata of rdd is the metadata of the first predecessor node. Save the final rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap, and mark the execution status of ActionNode6 as completed.
[0148] The only action node that all predecessor nodes have executed is ActionNode5, so ActionNode5 is executed. ActionNode5 is a field splicing node, which is used to connect multiple input sources with the field specified by each input source as the connection key, and output the processed data. Its function is similar to the left in SQL. External connection operation.
[0149] Specifically, the data processing operation corresponding to ActionNode5 includes the following steps:
[0150] (1) Field splicing is an operation between multiple RDDs. First get the result rdd and metadata of ActionNode2 and ActionNode4. Each predecessor node of the field splicing operation must have a specified field, and different predecessor nodes can have different fields. In the field splicing operation, only when the values ​​of all specified fields of multiple input sources are equal, can it proceed.
[0151] (2) Traverse the result rdd of each predecessor node, traverse each specified field internally, filter out the corresponding content of each field, as a separate rdd, after two-level traversal, you will get x groups of rddList, in each rddList Contains y rdd. Among them, x represents the number of predecessor nodes, and y represents the number of fields specified by each predecessor Node.
[0152] (3) Traverse multiple rddLists, perform zip operation on the rdd in the same rddList, and then convert the result from Tuple(x1, x2) to String form, the specific way is when rdd is the first rdd in rddList When, rddTmp=rdd, otherwise, rddTmp=rddTmp.zip(rdd).map(x=>x._1+"\001"+x._2), and finally get a group of rdd with only specified fields.
[0153] (4) Traverse this group of rdd, and let the result rdd of each rdd and its corresponding predecessor node perform a zip operation to form a form of Tuple (joinkey, line). Finally, a set of uniform RDDs are generated, the format of which is RDD[Tuple[String, String]].
[0154] (5) Traverse the set of rdd generated above, and perform the leftOuterjoin operation in turn. Specifically, it is rdd1.leftOuterjoin(rdd2).mapValues(x=>x._1+"\001"+x._2). After the traversal is completed, an rdd with only reserved fields is finally obtained.
[0155] (6) Restore the above rdd format to RDD[String] through rdd.map(_._2).
[0156] (7) So far, ActionNode5 has finished running. The result is a reserved field in rdd, so the metadata should be a reserved field and its corresponding index. Save the final rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap, and mark the execution status of ActionNode5 as completed.
[0157] The only action node that all predecessor nodes have executed is ActionNode7, so ActionNode7 is executed. ActionNode7 is a subtraction node, used to delete from the first input source the content that contains the same value as the specified field in the second input source, and output the processed data.
[0158] Specifically, the data processing operation corresponding to ActionNode7 includes the following steps:
[0159] (1) Only two predecessor nodes are allowed in the subtraction operation. First get the result rdd and metadata of ActionNode5 and ActionNode6. Each predecessor node of the difference set operation must have a specified field, and different predecessor nodes can have different fields. In the difference set operation, only the values ​​of all the specified fields of the two input sources are equal, can it proceed.
[0160] (2) Traverse the result rdd of each predecessor node, traverse each specified field internally, filter out the corresponding content of each field, as a separate rdd, after two-level traversal, you will get x groups of rddList, in each rddList Contains y rdd. Among them, x represents the number of predecessor nodes, and y represents the number of fields specified by each predecessor Node.
[0161] (3) Traverse multiple rddLists, perform zip operation on the rdd in the same rddList, and then convert the result from Tuple(x1, x2) to String form, the specific way is when rdd is the first rdd in rddList When, rddTmp=rdd, otherwise, rddTmp=rddTmp.zip(rdd).map(x=>x._1+"\001"+x._2), and finally get two rdd with only specified fields.
[0162] (4) For these two rdds, let the result rdd of each rdd and its corresponding predecessor node perform a zip operation to form the form of Tuple (joinkey, line). Finally, two unified rdd formats are generated, the format is RDD[Tuple[String, String]].
[0163] (5) Perform subtract operations on the two rdds generated above. Specifically, rdd1.subtract(rdd2), and finally get an rdd.
[0164] (6) Restore the above rdd format to RDD[String] through rdd.map(_._2).
[0165] (7) So far, ActionNode7 has finished running. As a result, the metadata of rdd is the metadata of the first predecessor node. Save the final rdd to resutRddMap, and save the metadata corresponding to rdd to resutRddMetaMap, and mark the execution status of ActionNode7 as completed.
[0166] The only action node that all predecessor nodes have executed is ActionNode8, so ActionNode8 is executed. ActionNode8 is a save node, used to save the input source as a Hive table or HDFS file.
[0167] Specifically, the data processing operation corresponding to ActionNode8 includes the following steps:
[0168] (1) Obtain the rdd and metadata output by the precursor node, that is, the result rdd and metadata of ActionNode7.
[0169] (2) Determine the specified storage type. If it is an HDFS file, directly store the result rdd in the form of an ordinary file on the path specified by HDFS; if it is a Hive table, store the result rdd in the database directory specified by Hive to create an external Table mapping association.
[0170] (3) Mark the execution status of ActionNode8 as executed. Because ActionNode8 is the final node, there is no need to store the result rdd and metadata.
[0171] In summary, the calculation method provided by the first embodiment of the present invention therefore only needs to modify the configuration file and reconfigure the combination relationship between nodes and edges for different business requirements. The configuration process is flexible and convenient. There is no need to change the code in the code, which realizes the principle of writing the code once and using it many times. At the same time, it significantly improves the development efficiency of computing programs for different businesses and reduces the workload of program developers. This method can be applied to Spark programs, but is not limited to being applied to Spark programs.

Example

[0172] Second embodiment
[0173] In the second embodiment, the calculation method is further explained by comparing the calculation method provided by the embodiment of the present invention with the SQL statement.
[0174] The default database in Hive has two tables: student and sc. The fields of student are sno, sname, sage, sex, and the fields of sc are sno, cno, and score. Figure 4 It shows a schematic diagram of the content of the student table provided by the second embodiment of the present invention. Figure 5 It shows a schematic diagram of the content of the score table provided by the second embodiment of the present invention.
[0175] Now there is a demand: I want to find out all the results of all female students. If you use sparkSQL directly, that is, use a SQL-like way, then the SQL statement should be:
[0176] select score, cno from sc where sno in(select sno from student wheresex=‘female’)
[0177] For ease of explanation, the SQL statement is transformed:
[0178] select a.score, a.cno from sc a join(select sno from student where sex=’female’)b on a.sno=b=sno
[0179] That is, SQL is finally converted to query with tables.
[0180] If the requirements are constantly changing and the degree of complexity is not the same, using a set of code operations, then the SQL statement must be dynamically generated, and the quality of the complex SQL statement automatically generated by the machine is not good, and the method of dynamically generating SQL is unreliable.
[0181] Therefore, the calculation method provided by the embodiment of the present invention can be used. In the above SQL statement, there are two data source nodes, corresponding to the student table and the sc table. Other select, from, join and other operations will be corresponding to the corresponding action nodes.
[0182] The directed acyclic graph corresponding to the above SQL statement is as Image 6 As shown, DataNode1 and DataNode2 are data source nodes, ActionNode1 to ActionNode5 are action nodes, and the corresponding SQL statement fragments are indicated in the box of each node. The connecting line with arrows indicates the flow of data. Unlike SQL statements, the execution of SQL directly outputs the final result, and according to the directed acyclic graph execution, there will be a result after each node operation, and the output of each node will become an arrow The input of the pointed node is finally output to the hard disk in the form of an HDFS file or Hive table.
[0183] The keywords in the SQL statement are fixed, and different SQLs are just in different splicing order. The present invention is also similar, the type of each node is fixed, but the number and splicing sequence are different. Therefore, the changing SQL requirements are transformed into the splicing requirements of the changing directed acyclic graph. For different business requirements, you only need to change the combination of nodes and edges in the configuration file, and the parameters of the nodes.
[0184] In the process of using RDD[String] to calculate, the columns are connected together, which is equivalent to a whole row of content that can be obtained. Figure 7 A schematic diagram of the rdd corresponding to the student table provided by the second embodiment of the present invention is shown. Reference Figure 7 , The columns are spliced ​​into a string, and each column is separated by "\001" (this character is not visible, so it is not shown in the figure). If you want to get a field, such as sage, you have First split the string according to the separator, and then get it according to the position 2 where sage is located (before sno, sname, numbering from 0). Therefore, each string needs to be described by metadata (field name and field index).
[0185] Figure 8 A schematic diagram of the rdd and metaMap output by each node in the business process provided by the second embodiment of the present invention is shown. Reference Figure 8 You can see the execution process of the directed acyclic graph and the output generated after the execution of the data processing operation corresponding to each node.
[0186] For the points not mentioned in the second embodiment of the present invention, reference may be made to the related description in the first embodiment, which will not be repeated here.
[0187] Second embodiment
[0188] Picture 9 It shows a functional block diagram of a computing device 200 provided by the third embodiment of the present invention. Reference Picture 9 , The device includes a reading module 210, a construction module 220, and an execution module 230.
[0189] Wherein, the reading module 210 is used to read and parse the configuration file. The content of the configuration file includes multiple nodes and at least one edge connecting the multiple nodes, where each node is used to represent a data processing unit in the business process, Each edge is used to indicate the direction of data flow between two nodes;
[0190] The construction module 220 is used to create multiple nodes and construct a directed acyclic graph representing business processes based on multiple nodes and at least one edge, where each node and the data processing operation corresponding to each node are defined in advance In the generated package;
[0191] The execution module 230 is configured to execute the data processing operation corresponding to each node according to the data flow direction in the directed acyclic graph until the data processing operation corresponding to each node is executed.
[0192] The computing device 200 provided by the third embodiment of the present invention has the same implementation principles and technical effects as the foregoing method embodiments. For a brief description, for parts not mentioned in the device embodiments, please refer to the corresponding content in the foregoing method embodiments. .

Example

[0193] Fourth embodiment
[0194] The fourth embodiment of the present invention provides a computer-readable storage medium on which computer program instructions are stored. When the computer program instructions are read and run by a processor, the steps of the calculation method provided by the embodiments of the present invention are executed . The computer-readable storage medium can be implemented as, but not limited to figure 1 The memory 102 is shown.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Communication forwarding system based on anonymous inference

ActiveCN110519212ASimplify the configuration processEnable anonymous communication
Owner:北京中科海讯数字科技股份有限公司

Method and device for remotely and dynamically configuring equipment, computer equipment and storage medium

PendingCN113806449ASimplify the configuration processImprove configuration efficiency
Owner:广州市钱大妈农产品有限公司

Self-management method, device and system for base station information

PendingCN107872814ASimplify the configuration processSave configuration time
Owner:ZTE CORP

Flow processing method and device, data processing method and device, equipment and storage medium

PendingCN113961681ASimplify the configuration processImprove efficiency
Owner:SHENZHEN ZHUIYI TECH CO LTD

Classification and recommendation of technical efficacy words

  • Simplify the configuration process
  • Improve development efficiency

Wireless router simple and convenient configuration device and router configuration method by adoption of wireless router simple and convenient device

InactiveCN104320286ASimplify the configuration processEasy to connect to the network
Owner:INFORMATION & COMM COMPANY OF STATE GRID HEILONGJIANG ELECTRIC POWER COMPANY +1

Dynamic plug-in type protocol parsing method of substation equipment

ActiveCN104320415ASimplify the configuration processShorten up time
Owner:STATE GRID CORP OF CHINA +4

Self-management method, device and system for base station information

PendingCN107872814ASimplify the configuration processSave configuration time
Owner:ZTE CORP

Flow processing method and device, data processing method and device, equipment and storage medium

PendingCN113961681ASimplify the configuration processImprove efficiency
Owner:SHENZHEN ZHUIYI TECH CO LTD

Communication forwarding system based on anonymous inference

ActiveCN110519212ASimplify the configuration processEnable anonymous communication
Owner:北京中科海讯数字科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products