A distributed deep forest optimization method based on adaptive partitioning of sub-forests
By adopting an adaptive subforestation distributed deep forest optimization method, combined with task-parallel subforest algorithm and Hadoop data blocks, the training efficiency of multi-granularity cascaded forests is optimized, solving the problems of fixed subforest partitioning granularity and large communication overhead, and realizing efficient distributed training and robustness enhancement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- EAST CHINA NORMAL UNIV
- Filing Date
- 2023-03-09
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies suffer from problems such as low training efficiency of multi-granularity cascaded forests, fixed subforest partitioning granularity that cannot be adaptively adjusted, high communication transmission overhead, and high additional overhead for sampling operations in distributed environments.
We adopt an adaptive sub-forest distributed deep forest optimization method, which combines task-parallel sub-forest algorithm, Hadoop distributed file system data blocks, and two-stage pre-aggregation and system-level backup to optimize the data communication process, reduce communication complexity, and improve training speed and system robustness.
Adaptive subforest aggregation granularity adjustment was achieved, reducing communication overhead, improving the parallel training speed of the model, enhancing system robustness, and optimizing the distributed sampling operation process.
Smart Images

Figure CN116522749B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of distributed machine learning technology, and in particular to a distributed deep forest optimization method based on adaptive subforest partitioning. Background Technology
[0002] In recent years, deep learning has become the fundamental paradigm of machine learning. Deep neural networks (DNNs) have achieved tremendous success in applications such as computer vision, natural language processing, and data mining, bringing machine learning closer to the goal of artificial intelligence. Although DNNs perform exceptionally well in various machine learning tasks, they have many drawbacks, such as requiring large amounts of training data to achieve good inference performance; a large model parameter space, consuming significant computational resources during training, necessitating the use of GPUs or TPUs for acceleration; and the need for complex hyperparameter tuning. Therefore, a novel Deep Forest paradigm, called Multi-Grained Cascade Forest (gcForest), has been proposed as a deep learning model that can replace deep neural networks. Compared to deep neural networks, deep forests offer several advantages: thanks to ensemble learning, deep forests use cross-validation to evaluate the performance of cascaded forest layers, determining whether new cascaded forest layers need to be generated based on whether the validation results converge, thus adaptively adjusting the model depth during the training phase. They have fewer hyperparameters than deep neural networks, saving considerable time on parameter tuning, and use trees as the foundation of the model, resulting in higher interpretability. These advantages make it possible for deep forests to replace deep neural networks.
[0003] Multi-granularity cascaded forests, as a pioneering work in deep forest models, have achieved excellent prediction accuracy and model performance in scenarios such as image classification and facial recognition. However, with the increase in data volume and model complexity, non-distributed single-machine training methods are not efficient. For example, a host equipped with an Intel 2.3GHz Xeon E5-V3 quad-core processor and 16GB of memory takes 11 hours to complete the training of a multi-granularity cascaded forest.
[0004] To address the low training efficiency of multi-granularity cascaded forests, researchers proposed the distributed multi-granularity cascaded forest (ForestLayer), designing a sub-forest parallel algorithm. This algorithm divides the forest into multiple sub-forests, computing them in parallel across multiple machines, thus improving the training speed of deep forests. While the sub-forest parallel algorithm alleviates the low training efficiency of non-distributed multi-granularity cascaded forests to some extent, the sub-forest partitioning granularity in distributed multi-granularity cascaded forests is fixed and cannot adaptively adjust to changes in the distributed environment. Determining the granularity of sub-forest partitioning, i.e., the number of partitions, becomes a new problem. To solve the sub-forest partitioning granularity problem, an adaptive sub-forest partitioning distributed deep forest (BLB-gcForest) was proposed. BLB-gcForest combines the sub-forest parallel algorithm with the Bag of Little Bootstrap (BLB) method, achieving adaptive selection of sub-forest partitioning granularity and employing tree-based parallelism to improve the granularity of parallelism. Compared to distributed multi-granularity cascaded forests, BLB-gcForest employs a finer-grained parallel training scheme within the cascaded forest framework, accelerating parallel efficiency. However, due to its tree-based parallelism, the decision tree is computed across multiple worker nodes. During class vector aggregation, data needs to be transferred from multiple worker nodes to the same node for aggregation, leading to significant communication overhead. In distributed storage clusters such as the Hadoop Distributed File System (HDFS), the BLB method requires scanning the entire dataset, which incurs additional sampling overhead for remote data access.
[0005] Existing technologies suffer from problems such as excessively large subforest aggregation granularity and additional overhead of sampling operations in distributed environments. Summary of the Invention
[0006] The purpose of this invention is to address the shortcomings of existing technologies by providing a distributed deep forest optimization method based on adaptive subforest partitioning. This method constructs a distributed deep forest model based on a task-parallel subforest algorithm, combines Hadoop Distributed File System data blocks with the adaptive subforest partitioning parallel algorithm, and performs distributed deep forest training. It designs data block-level presampling, two-stage pre-aggregation, and system-level backup to optimize the distributed deep forest training process, reduce communication complexity, decrease data transmission overhead, improve the parallel training speed of the model, optimize the data communication process, enhance system robustness, and significantly optimize the subforest aggregation method and distributed sampling operation process. It effectively solves problems such as excessively large subforest aggregation granularity and additional overhead of sampling operations in a distributed environment. The method is simple, effective, and has excellent application prospects.
[0007] The specific technical solution to achieve the purpose of this invention is: a distributed deep forest optimization method based on adaptive sub-forest partitioning, characterized in that the method specifically includes the following steps:
[0008] S1. Model the deep forest as a distributed deep forest model based on the task-parallel sub-forest algorithm; the deep forest consists of a multi-granularity scanning stage and a cascaded forest stage.
[0009] The cascaded forest can be represented as a set of multiple random forests composed of sub-forests, with each sub-forest using multiple servers for parallel computation as follows:
[0010] 1) Establish a cascaded forest stage ,Depend on cascaded forest layers Composition; each cascaded forest layer ,Depend on Random Forest Composition; each random forest ,Depend on The decision tree is composed of 10 decision trees; the random forest is divided into 10 parts. Individual Forest .according to Each subforest can be calculated Contains Decision Tree:
[0011] .
[0012] 2) Let the dataset be... The corresponding dataset labels are The dataset size is For the dataset Perform sampling without replacement The resulting size is dataset ;right Perform sampling with replacement The resulting size is dataset .
[0013] 3) Set up a decision tree The evaluation function is The evaluation function calculates the result class vector. The subforest calculates a weighted average of the class vectors from each decision tree to obtain the subforest's class vector. ,Right now Similarly, Random Forest sums and averages the class vectors of each subforum to obtain the class vector of the Random Forest. ,Right now The cascaded forest layer concatenates the class vectors of each random forest to obtain the first... Cascaded forest layer class vectors ,Right now According to the operational steps of multi-granularity cascaded forests, the class vectors of each cascaded forest layer will be compared with those of the original dataset. splicing As the feature set of the next level of connected forest layers, i.e. When cascading forest layers Class vectors When convergence occurs, it is... Each random forest class vector Perform a summation and averaging operation to obtain the final class vector. ,Right now .
[0014] S2. Use a presampling algorithm to process the dataset results after multi-granularity scanning, generate Hadoop Distributed File System (HDFS) data blocks, and in the cascade forest stage, select local HDFS data blocks for sampling without replacement based on the server they are on, to obtain a subset of samples for each forest in the cascade forest.
[0015] The results after multi-granularity scanning are segmented into A set of mutually exclusive subsets of the same size. ,in This is the number of HDFS data blocks. Then, sequentially... The datasets were randomly shuffled and divided into smaller subsets. Data blocks, recorded as Then for different ,same If the data blocks with the specified subscripts are merged into one data block, then there are a total of One data block. Each data block is directly used as a storage file in HDFS, and each data block is divided into h smaller data blocks. Using a presampling algorithm ensures that the data distribution of the data blocks remains consistent with the original dataset while reducing the dataset size. After presampling, the cascaded forest retrieves local HDFS data blocks. The feature space and labels are passed into the cascaded forest layer. Each random forest in the cascaded forest layer... use conduct Sub-sampling without replacement Obtain a subsample set , co-generated The size is A subset of samples is transmitted to each worker node; the random forest is divided into multiple sub-forests, each sub-forest... Pair sample set Perform sampling with replacement ,get A dataset of size n The data is then passed into the sub-forest for calculation.
[0016] S3. Train the cascaded forest layers, and use pre-aggregation and backup techniques to process the intermediate vectors of the cascaded forest layers to optimize the data communication and transmission process.
[0017] In the In a cascaded forest layer, each random forest Divided into multiple sub-forests The feature set of this layer conduct The second sampling without replacement was obtained indivual back, Iteratively let each subforest right Sampling with replacement was performed to obtain , sub forest The decision tree is divided using a round-robin method. Group, i.e. ,in Represents the number of working nodes. The number of decision trees in the subgroup, m, can be calculated using the following formula (b):
[0018] .
[0019] Send the grouped decision trees to The computation is performed in parallel by multiple worker nodes, each using the evaluation function of the decision tree. right The dataset is evaluated to obtain the class vectors of the trees. Then these class vectors from Each worker node transmits data to another worker node for a two-stage pre-aggregation. The first stage of pre-aggregation involves aggregating data in the local memory of each worker node to calculate the subforest. At the work node The above calculation yields the pre-aggregation results of the class vectors. ,Right now ,in Representative sub-forest At the work node The pre-aggregated class vectors are generated on each worker node; during the first stage of pre-aggregation at each worker node, the class vectors calculated from each decision tree are simultaneously generated. Backups are performed on the local disk of the worker node, and the vector class is identified as... This indicates that the vector belongs to the first class. Random Forest Neutron Forest Decision tree The second stage of pre-aggregation is the sub-forest. The class vectors pre-aggregated from each node are transmitted to the same node for full aggregation, ultimately yielding the class vectors of the sub-forest. ,Right now After completing the pre-aggregation and backup mechanism, the sub-forest will be... Class vectors The class vectors are aggregated into random forests respectively. The random forest class vectors are concatenated to obtain the first... Cascaded forest layer class vectors , passed to the The cascaded forest layers are used to perform data processing with the dataset. By splicing the features together, we obtain the feature set of the next level of connected forest layers. When cascading forest layers Class vectors During convergence, for the cascaded forest layer The summation and averaging operations are performed on each random forest class vector to obtain the final result class vector. .
[0020] Compared with the prior art, the present invention has the following advantages and significant technical effects:
[0021] 1) This invention enables adaptive adjustment of the subforest aggregation granularity. It employs a two-stage subforest aggregation process, fully utilizing the local memory of worker nodes as a cache, thus reducing the amount of data exchanged over the network. It also exhibits good compatibility with existing distributed multi-granularity cascaded forests. Furthermore, considering the communication overhead of distributed sampling and based on the characteristics of the adaptive subforest partitioning algorithm, a block-level pre-sampling algorithm is designed. This ensures that the statistical attributes and data distribution of local data blocks are essentially consistent with the original dataset, allowing for direct use of local data blocks as sampling samples and avoiding the communication overhead associated with distributed sampling.
[0022] 2) The designed system-level backup mechanism backs up the intermediate data of the cascaded forest with minimal disk usage, enhancing the robustness of the distributed system. When errors occur in the distributed environment and task rollback is needed, the backup information can be used to quickly locate where to restart the task. Simultaneously, for the... After backing up the class vectors of the hierarchical forest layers, if the first... The cascaded forest layer receives the first layer. If a crash occurs during the layer's class vector training, retraining is not required. A cascading forest of layers, directly from the first layer... The layer obtains the backup class vector from... The layer begins retraining. Attached Figure Description
[0023] Figure 1 This is a framework diagram for the present invention. Detailed Implementation
[0024] The present invention will now be described in detail with reference to the accompanying drawings and embodiments. Obviously, the examples listed are only for explaining the present invention and are not intended to limit the scope of the invention.
[0025] See Figure 1 The present invention discloses a method for optimizing distributed deep forests based on adaptive sub-forest partitioning, which specifically includes the following steps:
[0026] S1. Model the deep forest as a distributed deep forest model based on the task-parallel sub-forest algorithm; the deep forest consists of a multi-granularity scanning stage and a cascaded forest stage.
[0027] Cascaded forests are represented as a collection of random forests consisting of multiple sub-forests, with each sub-forest using multiple servers for parallel computation as follows.
[0028] 1) Establish a cascaded forest stage ,Depend on cascaded forest layers Composition; each cascaded forest layer ,Depend on Random Forest Composition; each random forest ,Depend on The decision tree is composed of 10 decision trees; the random forest is divided into 10 parts. Individual Forest .according to Each subforest can be calculated. Contains A decision tree.
[0029] 2) Let the dataset be... The corresponding dataset labels are The dataset size is For the dataset Perform sampling without replacement The resulting size is dataset ;right Perform sampling with replacement The resulting size is dataset .
[0030] 3) Set up a decision tree The evaluation function is The evaluation function calculates the result class vector. The subforest calculates a weighted average of the class vectors from each decision tree to obtain the subforest's class vector. ,Right now Similarly, Random Forest sums and averages the class vectors of each subforum to obtain the class vector of the Random Forest. ,Right now The cascaded forest layer concatenates the class vectors of each random forest to obtain the first... Cascaded forest layer class vectors ,Right now According to the operational steps of multi-granularity cascaded forests, the class vectors of each cascaded forest layer will be compared with those of the original dataset. splicing As the feature set of the next level of connected forest layers, i.e. When cascading forest layers Class vectors When convergence occurs, it is... Each random forest class vector Perform a summation and averaging operation to obtain the final class vector. ,Right now S2. Process the dataset results after multi-granularity scanning using a presampling algorithm to generate Hadoop Distributed File System (HDFS) data blocks. In the cascaded forest stage, select local HDFS data blocks for sampling without replacement based on the server, to obtain a subset of samples for each forest in the cascaded forest.
[0031] The results after multi-granularity scanning are segmented into A set of mutually exclusive subsets of the same size. ,in This is the number of HDFS data blocks. Then, sequentially... The datasets were randomly shuffled and divided into smaller subsets. Data blocks, recorded as Then for different ,same If the data blocks with the specified subscripts are merged into one data block, then there are a total of One data block. Each data block is directly used as a storage file in HDFS, and each data block is divided into h smaller data blocks. Using a presampling algorithm ensures that the data distribution of the data blocks remains consistent with the original dataset while reducing the dataset size. After presampling, the cascaded forest retrieves local HDFS data blocks. The feature space and labels are passed into the cascaded forest layer. Each random forest in the cascaded forest layer... use conduct Sub-sampling without replacement Obtain a subsample set , co-generated The size is A subset of samples is transmitted to each worker node; the random forest is divided into multiple sub-forests, each sub-forest... Pair sample set Perform sampling with replacement ,get A dataset of size n The data is then passed into the sub-forest for calculation.
[0032] S3. Train the cascaded forest layers, and use pre-aggregation and backup techniques to process the intermediate vectors of the cascaded forest layers to optimize the data communication and transmission process.
[0033] In the In a cascaded forest layer, each random forest Divided into multiple sub-forests The feature set of this layer conduct The second sampling without replacement was obtained indivual back, Iteratively let each subforest right Sampling with replacement was performed to obtain , sub forest The decision tree is divided using a round-robin method. Group, i.e. ,in Represents the number of working nodes. The number of decision trees in the subgroup, m, can be calculated using the following formula (b):
[0034] .
[0035] Send the grouped decision trees to The computation is performed in parallel by multiple worker nodes, each using the evaluation function of the decision tree. right The dataset is evaluated to obtain the class vectors of the trees. Then these class vectors from Each worker node transmits data to another worker node for a two-stage pre-aggregation. The first stage of pre-aggregation involves aggregating data in the local memory of each worker node to calculate the subforest. At the work node The above calculation yields the pre-aggregation results of the class vectors. ,Right now ,in Representative sub-forest At the work node The pre-aggregated class vectors are generated on each worker node; during the first stage of pre-aggregation at each worker node, the class vectors calculated from each decision tree are simultaneously generated. Backups are performed on the local disk of the worker node, and the vector class is identified as... This indicates that the vector belongs to the first class. Random Forest Neutron Forest Decision tree The second stage of pre-aggregation is the sub-forest. The class vectors pre-aggregated from each node are transmitted to the same node for full aggregation, ultimately yielding the class vectors of the sub-forest. ,Right now After completing the pre-aggregation and backup mechanism, the sub-forest will be... Class vectors The class vectors are aggregated into random forests respectively. The random forest class vectors are concatenated to obtain the first... Cascaded forest layer class vectors , passed to the The cascaded forest layers are used to perform data processing with the dataset. By splicing the features together, we obtain the feature set of the next level of connected forest layers. When cascading forest layers Class vectors During convergence, for the cascaded forest layer The summation and averaging operations are performed on each random forest class vector to obtain the final result class vector. .
[0036] The above is merely a further description of the present invention and is not intended to limit the scope of this patent. Any equivalent implementation of the present invention should be included within the scope of the claims of this patent.
Claims
1. A distributed deep forest optimization method based on adaptive partitioning of sub-forests, characterized in that, The method includes the following specific steps: S1. Construct a distributed deep forest model based on the task-parallel sub-forest algorithm, wherein the deep forest consists of a multi-granularity scanning stage and a cascaded forest stage; S2. Use a presampling algorithm to process the dataset results after multi-granularity scanning, generate Hadoop Distributed File System data blocks, and in the cascade forest stage, select local HDFS data blocks for sampling without replacement based on the server they are on to obtain a subset of samples for each forest in the cascade forest. S3. Use pre-aggregation and backup techniques to train the intermediate vectors of the cascaded forest layer in a distributed manner, thus optimizing the data communication and transmission process.
2. The distributed deep forest optimization method based on adaptive partitioning sub-forests according to claim 1, wherein, The step S1, which constructs a distributed deep forest model based on a task-parallel subforest algorithm, represents the cascaded forest as a set of multiple random forests composed of subforests. Each subforest uses multiple servers to perform the following parallel computation. 1) Establish a cascaded forest stage Depend on cascading forest layers Composition, that is Each cascaded forest layer Depend on Random forest Composition, that is Each random forest Random forest is divided into Individual forest, that is and according to Calculate each subforest Contains Decision Tree: ; 2) Set data set as , corresponding data set label is , data set size is , and the data set is operated without replacement sampling to obtain a data set with a size of ; the data set is operated with replacement sampling to obtain a data set with a size of ; 3) Set decision tree The evaluation function is The calculation result of the evaluation function is the result class vector The child forest will weight average the class vector of each decision tree to obtain the class vector of the child forest , that is Similarly, the random forest will sum average the class vector of each child forest to obtain the class vector of the random forest , that is The cascade forest layer will splice the class vector of each random forest to obtain the class vector of the cascade forest layer of the layer, that is , that is ; Based on the operational steps of a multi-granularity cascaded forest, the class vectors of each cascaded forest layer will be compared with those of the original dataset. splicing As the feature set of the next level of connected forest layers, i.e. When cascading forest layers Class vectors When convergence occurs, it is... Each random forest class vector Perform a summation and averaging operation to obtain the final class vector. ,Right now .
3. The distributed deep forest optimization method based on adaptive subforest partitioning according to claim 1, characterized in that, Step S2 uses a presampling algorithm to process the dataset results after multi-granularity scanning, and generates Hadoop Distributed File System data blocks by dividing the results after multi-granularity scanning into... A set of mutually exclusive subsets of the same size. ,in This is the number of HDFS data blocks, and then sequentially... The datasets were randomly shuffled and divided into smaller subsets. Data blocks, recorded as Then for different ,same If the data blocks with the specified subscripts are merged into one data block, then there are a total of This data block Each data block is directly used as a storage file in HDFS, and each data block is divided into h smaller data blocks. After processing using a presampling algorithm, the cascaded forest obtains local HDFS data blocks. The feature space and labels are passed into the cascaded forest layer, and each random forest in the cascaded forest layer... use conduct Sub-sampling without replacement Obtain a subsample set , co-generated The size is A subset of samples is transmitted to each worker node; the random forest is divided into multiple sub-forests, each sub-forest... Pair sample set Perform sampling with replacement ,get A dataset of size n The data is then passed into the sub-forest for calculation.
4. The distributed deep forest optimization method based on adaptive subforest partitioning according to claim 1, characterized in that, The training of the distributed cascaded forest layer in step S3 uses pre-aggregation and backup techniques to process the intermediate vectors of the distributed sub-forests and optimize the data communication and transmission process. Specifically, in the first... In a cascaded forest layer, each random forest Divided into multiple sub-forests The feature set of this layer conduct The second sampling without replacement was obtained indivual back, Iteratively let each subforest right Sampling with replacement was performed to obtain , sub forest The decision tree is divided using a polling method. Group, i.e. ,in Represents the number of working nodes. The number of decision trees in the subgroup, m, is calculated by the following formula (b): ; Send the grouped decision trees to The computation is performed in parallel by multiple worker nodes, each using the evaluation function of the decision tree. right The dataset is evaluated to obtain the class vectors of the trees. Then these class vectors from Each worker node transmits data to another worker node for a two-stage pre-aggregation. The first stage of pre-aggregation involves aggregating data in the local memory of each worker node to calculate the subforest. At the work node The above calculation yields the pre-aggregation results of the class vectors. ,Right now ,in Representative sub-forest At the work node The pre-aggregated class vectors are generated on each worker node; during the first stage of pre-aggregation at each worker node, the class vectors calculated from each decision tree are simultaneously generated. Backups are performed on the local disk of the worker node, and the vector class is identified as... This indicates that the vector belongs to the first class. Random forest Neutron Forest Decision tree ; The second stage of pre-aggregation is sub-forest. The class vectors pre-aggregated from each node are transmitted to the same node for full aggregation, ultimately yielding the class vectors of the sub-forest. ,Right now After completing the pre-aggregation and backup mechanism, the sub-forest will be... Class vectors The class vectors are aggregated into random forests respectively. The random forest class vectors are concatenated to obtain the first... Cascaded forest layer class vectors , passed to the The cascaded forest layers are used to perform data processing with the dataset. By splicing the features together, we obtain the feature set of the next level of connected forest layers. When cascading forest layers Class vectors During convergence, for the cascaded forest layer The summation and averaging operations are performed on each random forest class vector to obtain the final result class vector. .