Intelligent forestry big data processing method and system based on data lake

By constructing a multi-level geohashing code and separating metadata from binary objects, the problems of data skew and inefficiency in smart forestry big data processing are solved, achieving load balancing and efficient data processing.

CN122240737APending Publication Date: 2026-06-19BEIJING SHAOGUANG SHANGMEI TECHNOLOGY DEVELOPMENT CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING SHAOGUANG SHANGMEI TECHNOLOGY DEVELOPMENT CO LTD
Filing Date
2026-04-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for processing big data in smart forestry suffer from data skew due to uneven data distribution, low efficiency in processing large binary objects during the shuffle process, and a lack of efficient streaming data transmission and collaborative scheduling mechanisms, which affect processing efficiency and throughput.

Method used

By constructing a spatial data density heatmap for forestry data, adopting a multi-level geohashing coding strategy, and adapting different precision partitioning schemes for different density areas, the processing of metadata and binary objects is separated during the Shuffle write stage, and streaming data transmission is carried out in combination with the pre-merging and backpressure mechanisms of external Shuffle service nodes.

Benefits of technology

Load balancing was achieved, reducing the computational overhead and memory consumption of sorting operations, improving data writing efficiency, ensuring the stability and high throughput of the data processing link, and significantly shortening the processing time.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240737A_ABST
    Figure CN122240737A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of data processing, specifically a method and system for smart forestry big data processing based on a data lake. It constructs a spatial data density heatmap, obtains a multi-level geohashing encoding strategy, adapts different precision encodings to regions with different densities, and uses a variable-length geohashing partition prefix and the original data key to form a composite partition key and complete data partitioning. For write tasks, a sortable first metadata buffer and an unsorted second binary object buffer are allocated to store composite partition key metadata and forestry binary objects, respectively. When either buffer reaches a threshold, the data in the second buffer is first written to an overflow file and the corresponding metadata is updated. Then, the metadata in the first buffer is sorted according to the composite partition key and written to the overflow file. On an external Shuffle service node, a pre-merge process is initiated for each target partition, and the metadata data transfer rate and block size are dynamically adjusted. This invention can improve the Shuffle efficiency and stability during forestry big data processing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of data processing, and in particular relates to a smart forestry big data processing method and system based on data lake. Background Technology

[0002] With the rapid development of precision forestry and smart forestry, forestry management and scientific research are entering the era of big data. Data lakes, as a new data architecture, can store and process massive amounts of multi-source, heterogeneous forestry data, such as high-resolution remote sensing imagery, lidar point clouds, drone inspection videos, and IoT sensor time-series data. These data generally possess distinct spatial attributes and enormous volume. When using mainstream distributed computing frameworks such as Apache Spark to analyze and process this massive amount of forestry data, the shuffle process—that is, the repartitioning and exchange of data between different computational stages—is crucial to overall operational performance. However, the geographical distribution of forestry data is extremely uneven; for example, there are huge differences in data density between forest areas and non-forest areas, and between key monitoring areas and ordinary areas. Using conventional hash partitioning strategies can easily lead to severe data skew, causing some computational tasks to become overloaded. Furthermore, forestry data often contains large binary objects (LOBs). Frequent serialization, deserialization, sorting, and disk I / O operations on these records containing large objects during the shuffle process incur significant performance overhead, severely impacting processing efficiency.

[0003] Spatial indexing, such as geohashing, spatially partitions data and aggregates data from geographically similar locations into the same partition, improving data locality. Additionally, an architecture employing an external Shuffle service decouples the storage and management of Shuffle data from the compute nodes, enhancing the resilience and stability of the computing cluster. While these technologies optimize processing performance to some extent, they cannot dynamically adapt to the drastic spatial density variations of forestry data. This results in skewness in data-dense areas due to overly coarse partitioning granularity, or inefficient performance in sparse areas due to overly fine granularity, leading to numerous small files and tasks. Furthermore, in the Shuffle I / O path, existing implementations do not effectively separate metadata from large binary data volumes, causing heavy binary objects to be bundled into the sorting process, resulting in unnecessary computational and I / O resource waste. The lack of efficient streaming data transfer and collaborative scheduling mechanisms between downstream tasks and the external Shuffle service makes the blocking pull mode unsuitable for large-scale data merging scenarios, easily causing memory pressure and network congestion in downstream tasks, limiting end-to-end processing throughput. Summary of the Invention

[0004] To improve the processing efficiency of big data in smart forestry, this invention proposes a data lake-based method for processing big data in smart forestry, comprising the following steps:

[0005] During the Shuffle write phase, a forestry data spatial density heatmap is constructed by sampling or analyzing metadata. Based on the heatmap, a multi-level geohash coding strategy is formed to adapt geohash codes of different precision to different density geographic areas. The generated variable-length geohash partition prefix is ​​combined with the original data key to form a composite partition key, and the data is partitioned based on the composite partition key.

[0006] A first metadata buffer and a second binary object buffer are allocated for the write task. Metadata containing composite partition keys is stored in the first metadata buffer that needs to be sorted, and the corresponding forestry binary objects are stored in the second binary object buffer that is not sorted. When either buffer reaches the overflow threshold, the binary objects in the second buffer are sequentially written to the second overflow file. The file identifier, position and length information are obtained and updated to the corresponding metadata entries in the first buffer. Then, the metadata in the first buffer is sorted based on the composite partition key and written to the first overflow file.

[0007] On the external Shuffle service node, a pre-merge process is initiated for each target partition, continuously receiving and merging the first overflow file data blocks from the upstream task in multiple ways to generate a globally ordered metadata output stream. The downstream merging task pulls the metadata output stream through a streaming channel with a backpressure mechanism, asynchronously pulls forestry binary objects based on the file identifier, position and length information in the stream, and feeds back the consumption rate and buffer status to the external Shuffle service node so that it can dynamically adjust the transmission rate and block size of the metadata output stream.

[0008] In another aspect, the present invention proposes a smart forestry big data processing system based on a data lake, the system comprising the following units:

[0009] The partitioning unit is used in the Shuffle write phase to construct a spatial data density heatmap of forestry data by sampling or analyzing metadata. Based on the heatmap, a multi-level geohash coding strategy is formed to adapt geohash codes of different precision for different density geographic areas. The generated variable-length geohash partitioning prefix is ​​combined with the original data key to form a composite partitioning key, and the data is partitioned based on the composite partitioning key.

[0010] The overflow processing unit is used to allocate a first metadata buffer and a second binary object buffer for the write task. It stores the metadata containing the composite partition key into the first metadata buffer that needs to be sorted, and stores the corresponding forestry binary object into the unsorted second binary object buffer. When either buffer reaches the overflow threshold, the binary objects in the second buffer are written sequentially to the second overflow file. The file identifier, position and length information are obtained and updated to the corresponding metadata entry in the first buffer. Then, the metadata in the first buffer is sorted based on the composite partition key and written to the first overflow file.

[0011] The data output unit is used to initiate a pre-merge process for each target partition on the external Shuffle service node. It continuously receives and merges the first overflow file data blocks from the upstream task in multiple ways to generate a globally ordered metadata output stream. The downstream merging task pulls the metadata output stream through a streaming channel with a backpressure mechanism, asynchronously pulls forestry binary objects according to the file identifier, position and length information in it, and feeds back the consumption rate and buffer status to the external Shuffle service node so that it can dynamically adjust the transmission rate and block size of the metadata output stream.

[0012] Furthermore, the present invention also proposes a smart forestry big data processing program based on a data lake, which, when executed, implements the method described in the first aspect.

[0013] Compared with the prior art, the present invention has at least the following advantages:

[0014] 1) By constructing a spatial data density heatmap of forestry data and forming a multi-level geohash coding strategy, different precision partitioning schemes are adapted for different geographical regions, which solves the inherent uneven distribution problem of forestry spatial data, significantly alleviates data skew, and achieves load balancing among computing nodes.

[0015] 2) Separating the metadata that needs to be sorted from the large-volume forestry binary objects that do not need to be sorted, and sorting only the lightweight metadata greatly reduces the computational overhead and I / O burden caused by sorting operations, reduces memory consumption, and improves the efficiency of data writing.

[0016] 3) By pre-merging the globally ordered metadata stream on the external Shuffle service node and having downstream tasks pull the data asynchronously in a streaming manner, large-scale memory buffering of downstream tasks is avoided, effectively overlapping computation and data retrieval is achieved, and data transmission is smoothed through a feedback mechanism, ensuring the stability and high throughput of the data processing link. Attached Figure Description

[0017] Figure 1 A flowchart of the first embodiment;

[0018] Figure 2 This is a diagram illustrating buffer occupancy and block division.

[0019] Figure 3 A diagram illustrating the performance comparison of different solutions. Detailed Implementation

[0020] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0021] In the first embodiment, see Figure 1 The flowchart shown illustrates a smart forestry big data processing method based on a data lake, comprising the following steps:

[0022] Step 1: In the Shuffle write phase, a forestry data spatial density heatmap is constructed by sampling or analyzing metadata. Based on the heatmap, a multi-level geohash coding strategy is formed to adapt geohash codes of different precision to different density geographic areas. The generated variable-length geohash partition prefix is ​​combined with the original data key to form a composite partition key, and the data is partitioned based on the composite partition key.

[0023] The data lake contains a large amount of forestry data, and big data frameworks such as Spark read data from it. The geographic coordinates of the forestry data are sampled by calling the Apache Spark framework's `RDD.sample` method or using a reservoir sampling algorithm; the GeoTrellis library or JTS Topology is also used. The Suite library divides geospatial data into uniform grids, counts the number of sample points within each grid, and generates a density heatmap in the form of a two-dimensional integer array. It uses Jenks' natural breakpoint method or equidistant segmentation to divide data density into three or more levels: high, medium, and low. Different geohash encoding precisions are set for each density level; for example, 7 bits are used for high-density areas, 6 bits for medium-density areas, and 5 bits for low-density areas, forming a mapping table from density levels to hash precision. Each forestry data record is traversed, and based on the density level of the grid where its geographic coordinates are located, a variable-length geohash code with the corresponding precision is generated from the mapping table using the geotools or ch.hsr.geohash libraries. The generated geohash code is concatenated with the original data key using underscores to form a composite partition key. Finally, a custom partitioner inheriting from org.apache.spark.Partitioner is implemented. In its getPartition method, the hash value is calculated by calling the hashCode method of the Java object on the composite partition key, and then modulo the total number of partitions to determine the target partition ID of the data record.

[0024] Step 2: Allocate a first metadata buffer and a second binary object buffer for the write task. Store the metadata containing the composite partition key into the first metadata buffer that needs to be sorted, and store the corresponding forestry binary objects into the unsorted second binary object buffer. When either buffer reaches the overflow threshold, write the binary objects in the second buffer sequentially into the second overflow file, obtain their file identifier, position and length information and update the corresponding metadata entries in the first buffer. Then, sort the metadata in the first buffer based on the composite partition key and write it into the first overflow file.

[0025] For each write task, two buffers are allocated in off-heap memory of the JVM. The first buffer stores metadata object instances, and the second buffer stores binary object bytes. The metadata object contains fields such as a composite partition key, partition ID, offset and length of the binary object in the second buffer, and is stored in an ArrayList data structure as the first buffer. Forestry binary objects are directly appended to the second buffer supported by ByteBuffer. After each write, the number of metadata entries in the first buffer and the number of bytes written in the second buffer are checked to see if they exceed a preset record threshold and if they exceed a preset byte threshold. If either threshold triggers an overflow, the write method is used to sequentially write the entire contents of the second buffer (i.e., the ByteBuffer) to a temporary binary overflow file. The file path, the starting file pointer position of the data being written, and the total length of bytes written are recorded. The ArrayList in the first buffer is traversed, and the offset and length information in each metadata object are updated to their absolute positions in the overflow file. Then, a custom Comparator based on the composite partition key is passed in to sort the metadata objects in the ArrayList in place. The sorted ArrayList is serialized using the Kryo serialization framework and written to another independent metadata overflow file via FileChannel.

[0026] Step 3: On the external Shuffle service node, a pre-merge process is started for each target partition. It continuously receives and merges the first overflow file data blocks from the upstream task in multiple ways to generate a globally ordered metadata output stream. The downstream merge task pulls the metadata output stream through a streaming channel with backpressure mechanism. It asynchronously pulls the forestry binary object according to the file identifier, position and length information in the stream, and feeds back the consumption rate and buffer status to the external Shuffle service node so that it can dynamically adjust the transmission rate and block size of the metadata output stream.

[0027] External Shuffle service nodes use the Netty framework for network communication, maintaining a pre-merge service thread for each partition. This thread uses PriorityQueue as the core data structure to implement the K-way merge algorithm, where each upstream task's metadata overflow file corresponds to an input stream, and the queue stores the current minimum metadata item for each input stream. The merge thread continuously retrieves the globally minimum metadata item from the top of the priority queue, serializes it, and writes it to a network output buffer. Downstream tasks connect to the service nodes as Netty clients and receive metadata streams through a custom channel processor. The Netty framework's built-in flow control mechanism monitors the channel's isWritable state changes to implement basic backpressure; when slow client reception causes the channel to become unwritable, a backpressure response is initiated. The server automatically pauses sending; after receiving the metadata, the downstream task, based on the file identifier, location, and length information, uses an independent thread pool and AsynchronousFileChannel to initiate an asynchronous read request to the specified binary file region. After reading, the binary object is placed into a memory queue for subsequent processing. At the same time, the downstream task periodically calculates its own processing rate and the occupancy rate of the input queue, and sends a custom control message containing the current buffer occupancy percentage to the service node through the Netty channel. After receiving this message, the service node's channel processor calls the setWriteBufferWaterMark method to dynamically adjust the high and low watermarks, or directly controls the rate at which data is read from the priority queue, thereby accurately matching the downstream's consumption capacity.

[0028] In a preferred embodiment, the step of constructing a forestry data spatial density heatmap by sampling or analyzing metadata, and forming a multi-level geohash coding strategy based on the heatmap, specifically involves:

[0029] Set a preset resolution two-dimensional grid covering the geographic range of forestry data and initialize the count values ​​of all grid cells. Extract a preset proportion of forestry data as a sample set and traverse the data points to update the count values ​​of the corresponding grid cells. Convert the count values ​​into density values ​​based on the physical area of ​​the grid cells and perform normalization processing to form a density heatmap. Divide the geographic region into multiple density levels according to a preset density threshold, and then adapt geohash codes of different precision to each region according to the density level. The higher the density of the region, the higher the precision of the geohash code.

[0030] A two-dimensional grid, such as a 1024×1024 grid, covering the entire geographic area of ​​the forestry data is defined, and a counter with a count value of 0 is initialized for each grid cell. Then, 1% to 5% of the total forestry data (in the TB range) is randomly sampled as a sample set. Each data record containing geographic coordinates in the sample set is iterated through, mapping its latitude and longitude to the corresponding cell in the aforementioned two-dimensional grid, and incrementing the count value of that cell. After the iteration is complete, the count values ​​are converted into data point density using the actual geographic area of ​​the grid cells. Min-max normalization is performed on the density values ​​of all grid cells, mapping them to the [0,1] interval, thereby generating a visualized density heatmap.

[0031] Based on the normalized density values, multiple thresholds are set to divide the geographic region into at least three density levels, such as: low-density areas (density values ​​0-0.3), medium-density areas (0.3-0.7), and high-density areas (0.7-1.0). Different precision geohash codes are applied to different levels of regions: for low-density areas, a geohash code of length 4 is used, covering an area of ​​approximately 20km × 20km; for medium-density areas, a code of length 5 is used, covering approximately 5km × 5km; for high-density areas, a code of length 6 is used, covering approximately 1.2km × 0.6km, or even higher. This results in a mapping strategy from geographic coordinates to variable-length geohash code precision, ensuring finer-grained partitioning in densely populated areas to mitigate data skew.

[0032] In a preferred embodiment, combining the generated variable-length geohash partition prefix with the original data key to form a composite partition key specifically involves:

[0033] For each forestry data record, its original data key is extracted, the geographic coordinates of the data record are obtained, and the encoding precision is determined according to the multi-level geohash encoding strategy. A geohash encoding string with a specified precision is generated for the geographic coordinates as a partition prefix. The partition prefix is ​​concatenated with the original data key using a preset delimiter to generate a composite partition key.

[0034] For an input forestry data record, such as {"tree_id":"T00123", "species":"Pinussylvestris","location":{"lat":xx.8584,"lon":y.2945},"biomass_data":<binary_object> Extract its original data key, for example, "T00123". Obtain its geographic coordinates {"lat":xx.8584, "lon":y.2945} and query the pre-built multi-level geohash encoding strategy. Assuming the coordinates fall into a high-density area, the strategy requires a geohash code of length 7. Call a geohash library such as GeoHash-Java to generate an encoding for the coordinates, obtaining the partition prefix "u09tvqj". Choose a character that will not appear in either the original data key or the geohash code as a separator, such as an underscore _ or a vertical bar |. Concatenate the partition prefix, separator, and original data key to generate a composite partition key, such as "u09tvqj_T00123". The composite partition key preserves both spatial proximity and uniqueness.

[0035] In a preferred embodiment, storing the metadata containing the composite partition key into a first metadata buffer that needs to be sorted, and storing the corresponding forestry binary object into a second binary object buffer that is not sorted, specifically involves:

[0036] Each Shuffle write task initializes a first metadata buffer and a second binary object buffer of preset capacity. The composite partition key and other metadata of the incoming forestry data are serialized and stored as entries in the first metadata buffer. The forestry binary objects in the data records are stored in the second binary object buffer. In each entry of the first metadata buffer, a field is preset to record the corresponding binary object file identifier, location and length information.

[0037] At the start of each write task, two memory buffers are allocated. The first metadata buffer is a sortable, byte-based buffer, such as an ArrayList.<byte[]> Alternatively, a custom memory page manager can be used, with a default capacity typically ranging from 32MB to 128MB, preferably 64MB. The second binary object buffer is a simple ByteArrayOutputStream with a larger capacity, typically 128MB to 512MB, preferably 256MB. For each incoming data record, its metadata (including the composite partition key generated in the previous step, partition ID, length of the original value, etc.) is serialized into a compact byte array. The byte array also contains three initially empty reserved fields for later recording the identifier, offset, and length of the binary object in the overflow file; each field is 8 bytes in size, totaling 24 bytes. The serialized metadata entry is added to the first buffer. Forestry binary objects contained in this record, such as high-resolution image tiles and LiDAR point cloud data blocks, are appended to the end of the second buffer.

[0038] In a preferred embodiment, when any buffer reaches the overflow threshold, the binary objects in the second buffer are sequentially written to the second overflow file, their file identifiers, positions, and length information are obtained and updated to the corresponding metadata entries in the first buffer, and then the metadata in the first buffer is sorted based on the composite partition key and written to the first overflow file. Specifically:

[0039] The system continuously monitors the used capacity of the two buffers. When the used capacity of either buffer reaches a preset overflow threshold, it pauses the reception of new data, creates a first overflow file and a second overflow file, writes all binary objects in the second binary object buffer into the second overflow file in the order of reception, records the file identifier, starting offset and byte length of each binary object and updates the corresponding metadata entry in the first metadata buffer. After all information is updated, it sorts all entries in the first metadata buffer based on their composite partition key, and finally writes all sorted entries into the first overflow file.

[0040] The system continuously monitors the fill rate of the two buffers, with a preset overflow threshold of 80% of the buffer capacity. For example, an overflow operation is triggered when the used space of the 64MB first buffer reaches 51.2MB, or the used space of the 256MB second buffer reaches 204.8MB. The task pauses processing new data. The system creates two local overflow files, such as spill_task01_0.meta and spill_task01_0.data. All binary object data in the second buffer is sequentially written to the .data file in the original order in which they were received. During the writing process, the starting offset and byte length of each binary object in the file are recorded. File identifier, offset, length, and other information are filled back into the reserved fields of the corresponding metadata entries in the first buffer. After all metadata entries are updated, a memory sort is performed on all metadata entries in the first buffer, sorted lexicographically by the composite partition key in each entry. The sorted metadata entries are serialized and sequentially written to the .meta file, and both buffers are cleared to prepare for receiving the next batch of data.

[0041] In a preferred embodiment, the continuous receiving and multiplexing of the first overflow file data block from the upstream task to generate a globally ordered metadata output stream specifically involves:

[0042] Create N input streams for N upstream tasks and initialize a min-heap data structure sorted by composite partition key. Read the first metadata entry from each input stream and insert it into the min-heap along with its source stream identifier. Loop through the min-heap to retrieve the metadata entry with the smallest composite partition key from the top of the min-heap and write it into the globally ordered metadata output stream. Then read the next metadata entry from the source input stream of the retrieved entry and insert it into the min-heap until the min-heap is empty and all input streams have been exhausted.

[0043] Suppose an external Shuffle service node needs to merge metadata overflow .meta files from 10 upstream write tasks. The node creates an input stream for each of the 10 files. A min-heap of size 10 is initialized, with element comparison logic based on the lexicographical order of the composite partition keys in the metadata entries. When the merge process starts, it reads the first metadata entry from each input stream and packages it into an object along with the source stream identifier (e.g., indices 0-9), inserting the 10 objects into the min-heap. A loop then begins: the process extracts the metadata entry with the global minimum composite partition key from the top of the min-heap and writes it to an output stream, which becomes the globally ordered metadata stream. The process reads the next metadata entry from the corresponding input stream based on the source stream identifier accompanying the extracted entry. If the input stream is not exhausted, the newly read entry, along with its source identifier, is inserted back into the min-heap. This loop continues until all input streams have been read and the min-heap is empty, resulting in a globally ordered output stream of all upstream metadata.

[0044] In a preferred embodiment, the step of feeding back the consumption rate and buffer status to the external Shuffle service node, so that it can dynamically adjust the transmission rate and block size of the metadata output stream, specifically involves:

[0045] Downstream tasks maintain a local receive buffer and periodically send status information containing their data consumption rate and local buffer occupancy to external Shuffle service nodes. External Shuffle service nodes dynamically adjust the upper limit of data transmission rate based on the received data consumption rate and adjust the size of the next data block according to the received buffer occupancy rate. When the occupancy rate is low, the block size is increased, and when the occupancy rate is high, the block size is decreased.

[0046] The downstream merging task maintains a 16MB receive buffer locally to receive metadata streams. This task embeds a monitoring thread that calculates the number of metadata records processed in the past 500ms (consumption rate) every 500ms and checks the current receive buffer occupancy percentage. Two metrics, such as {"rate": 80000 records / sec, "buffer_fill_ratio": 0.25}, are sent to the upstream external Shuffle service node via a heartbeat mechanism or control channel. Upon receiving the status information, the external Shuffle service node performs adjustments: based on the 80000 records / sec consumption rate, it uses a token bucket algorithm to adjust its own metadata sending rate limit slightly higher, for example, to 88000 records / sec, to ensure data supply and avoid blind pushing. It also adjusts the data block size based on the buffer occupancy rate of 0.25. See also... Figure 2The relationship between buffer occupancy and data block size is addressed in one embodiment, with the following adjustment strategy: if the occupancy rate is below 30%, the data block size for the next transmission is increased from the default 256KB to 512KB to reduce the number of network transmissions; if the occupancy rate is between 30% and 80%, the 256KB size is maintained; if the occupancy rate is above 80%, the data block size is reduced to 64KB to reduce the instantaneous processing pressure on downstream tasks and prevent buffer overflow.

[0047] The experiment was conducted on a distributed computing cluster with 10 worker nodes, each equipped with a 32-core CPU, 128GB of memory, and a 10 Gigabit Ethernet card. The test dataset consisted of 5TB of mixed forestry remote sensing imagery and lidar point cloud data, with highly uneven geographic distribution. The control group employed a standard hash partitioning strategy, performing uniform Shuffle write, sorting, and read operations on complete data records containing geographic coordinates and binary objects. The experimental group deployed the full solution, using multi-level geohashing encoding based on data density to generate composite partition keys, and enabling metadata and binary object separation buffering, external merging services, and adaptive transmission mechanisms during the Shuffle phase.

[0048] Figure 3 For data recording during the experiment, the total execution time of the job was 3250 seconds in the control group and shortened to 1890 seconds in the experimental group. Regarding data skew, calculated as the ratio of the maximum partition size to the average partition size, the control group had a skew rate as high as 15.8, while the experimental group successfully reduced it to 2.1. During the Shuffle phase, the average write operation time decreased from 1280 seconds to 650 seconds, and the average read operation time decreased from 950 seconds to 520 seconds. In terms of resource utilization, the peak memory usage of the executor node decreased from 25.6GB to 18.2GB, and the percentage of garbage collection pauses caused by memory pressure significantly decreased from 18% to 7%. Figure 3 Comparative data demonstrates that multi-level geohashing, through fine-grained partitioning of data hotspots, alleviates data skew and ensures balanced load on downstream tasks, which is key to a nearly 42% reduction in total processing time. Simultaneously, the design of separating metadata and large object processing reduces the overhead of sorting operations by an order of magnitude, alleviating CPU and memory burdens, directly reflected in improved Shuffle write time and garbage collection efficiency. The synergistic effect of external merging and adaptive transmission optimizes network data flow, avoids data read bottlenecks, and ensures a smooth and efficient shuffle process.

[0049] In a second embodiment, the present invention also proposes a smart forestry big data processing system based on a data lake, the system comprising the following units:

[0050] The partitioning unit is used in the Shuffle write phase to construct a spatial data density heatmap of forestry data by sampling or analyzing metadata. Based on the heatmap, a multi-level geohash coding strategy is formed to adapt geohash codes of different precision for different density geographic areas. The generated variable-length geohash partitioning prefix is ​​combined with the original data key to form a composite partitioning key, and the data is partitioned based on the composite partitioning key.

[0051] The overflow processing unit is used to allocate a first metadata buffer and a second binary object buffer for the write task. It stores the metadata containing the composite partition key into the first metadata buffer that needs to be sorted, and stores the corresponding forestry binary object into the unsorted second binary object buffer. When either buffer reaches the overflow threshold, the binary objects in the second buffer are written sequentially to the second overflow file. The file identifier, position and length information are obtained and updated to the corresponding metadata entry in the first buffer. Then, the metadata in the first buffer is sorted based on the composite partition key and written to the first overflow file.

[0052] The data output unit is used to initiate a pre-merge process for each target partition on the external Shuffle service node. It continuously receives and merges the first overflow file data blocks from the upstream task in multiple ways to generate a globally ordered metadata output stream. The downstream merging task pulls the metadata output stream through a streaming channel with a backpressure mechanism, asynchronously pulls forestry binary objects according to the file identifier, position and length information in it, and feeds back the consumption rate and buffer status to the external Shuffle service node so that it can dynamically adjust the transmission rate and block size of the metadata output stream.

[0053] In a third embodiment, the present invention also proposes a smart forestry big data processing program based on a data lake, which, when executed by a processor, implements the method described in the first embodiment.

[0054] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.

[0055] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0056] The preferred embodiments disclosed above are merely illustrative of this specification. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments described herein. These embodiments are selected and specifically described in this specification to better explain the principles and practical applications of the embodiments, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.

Claims

1. A smart forestry big data processing method based on a data lake, characterized in that, Includes the following steps: During the Shuffle write phase, a forestry data spatial density heatmap is constructed by sampling or analyzing metadata. Based on the heatmap, a multi-level geohash coding strategy is formed to adapt geohash codes of different precision to different density geographic areas. The generated variable-length geohash partition prefix is ​​combined with the original data key to form a composite partition key, and the data is partitioned based on the composite partition key. A first metadata buffer and a second binary object buffer are allocated for the write task. Metadata containing composite partition keys is stored in the first metadata buffer that needs to be sorted, and the corresponding forestry binary objects are stored in the second binary object buffer that is not sorted. When either buffer reaches the overflow threshold, the binary objects in the second buffer are sequentially written to the second overflow file. The file identifier, position and length information are obtained and updated to the corresponding metadata entries in the first buffer. Then, the metadata in the first buffer is sorted based on the composite partition key and written to the first overflow file. On the external Shuffle service node, a pre-merge process is started for each target partition, continuously receiving and merging the first overflow file data blocks from the upstream task in multiple ways, generating a globally ordered metadata output stream; The downstream merging task pulls the metadata output stream through a streaming channel with backpressure mechanism. It asynchronously pulls forestry binary objects based on the file identifier, location, and length information in the stream, and feeds back the consumption rate and buffer status to the external Shuffle service node so that it can dynamically adjust the transmission rate and block size of the metadata output stream.

2. The method according to claim 1, characterized in that, The process involves constructing a spatial data density heatmap for forestry data through sampling or analyzing metadata, and then forming a multi-level geohashing coding strategy based on the heatmap. Specifically: Set a preset resolution two-dimensional grid covering the geographic range of forestry data and initialize the count values ​​of all grid cells. Extract a preset proportion of forestry data as a sample set and traverse the data points to update the count values ​​of the corresponding grid cells. Convert the count values ​​into density values ​​based on the physical area of ​​the grid cells and perform normalization processing to form a density heatmap. Divide the geographic region into multiple density levels according to a preset density threshold, and then adapt geohash codes of different precision to each region according to the density level. The higher the density of the region, the higher the precision of the geohash code.

3. The method according to claim 1, characterized in that, The process of combining the generated variable-length geohash partition prefix with the original data key to form a composite partition key is as follows: For each forestry data record, its original data key is extracted, the geographic coordinates of the data record are obtained, and the encoding precision is determined according to the multi-level geohash encoding strategy. A geohash encoding string with a specified precision is generated for the geographic coordinates as a partition prefix. The partition prefix is ​​concatenated with the original data key using a preset delimiter to generate a composite partition key.

4. The method according to claim 1, characterized in that, The step of storing metadata containing composite partition keys into a first metadata buffer that needs to be sorted, and storing the corresponding forestry binary objects into a second binary object buffer that is not sorted, specifically involves: Each Shuffle write task initializes a first metadata buffer and a second binary object buffer of preset capacity. The composite partition key and other metadata of the incoming forestry data are serialized and stored as entries in the first metadata buffer. The forestry binary objects in the data records are stored in the second binary object buffer. In each entry of the first metadata buffer, a field is preset to record the corresponding binary object file identifier, location and length information.

5. The method according to claim 1, characterized in that, When any buffer reaches the overflow threshold, the binary objects in the second buffer are sequentially written to the second overflow file. Their file identifiers, positions, and lengths are obtained and updated in the corresponding metadata entries of the first buffer. Then, the metadata in the first buffer is sorted based on the composite partition key and written to the first overflow file. Specifically: The system continuously monitors the used capacity of the two buffers. When the used capacity of either buffer reaches a preset overflow threshold, it pauses the reception of new data, creates a first overflow file and a second overflow file, writes all binary objects in the second binary object buffer into the second overflow file in the order of reception, records the file identifier, starting offset and byte length of each binary object and updates the corresponding metadata entry in the first metadata buffer. After all information is updated, it sorts all entries in the first metadata buffer based on their composite partition key, and finally writes all sorted entries into the first overflow file.

6. The method according to claim 1, characterized in that, The process of continuously receiving and multiplexing the first overflow file data blocks from the upstream task to generate a globally ordered metadata output stream is as follows: Create N input streams for N upstream tasks and initialize a min-heap data structure sorted by composite partition key. Read the first metadata entry from each input stream and insert it into the min-heap along with its source stream identifier. Loop through the min-heap to retrieve the metadata entry with the smallest composite partition key from the top of the min-heap and write it into the globally ordered metadata output stream. Then read the next metadata entry from the source input stream of the retrieved entry and insert it into the min-heap until the min-heap is empty and all input streams have been exhausted.

7. The method according to claim 1, characterized in that, The step of feeding back the consumption rate and buffer status to the external Shuffle service node, so that it can dynamically adjust the transmission rate and block size of the metadata output stream, specifically involves: Downstream tasks maintain a local receive buffer and periodically send status information containing their data consumption rate and local buffer occupancy to external Shuffle service nodes. External Shuffle service nodes dynamically adjust the upper limit of data transmission rate based on the received data consumption rate and adjust the size of the next data block according to the received buffer occupancy rate. When the occupancy rate is low, the block size is increased, and when the occupancy rate is high, the block size is decreased.

8. A smart forestry big data processing system based on a data lake, characterized in that, The system includes the following units: The partitioning unit is used in the Shuffle write phase to construct a spatial data density heatmap of forestry data by sampling or analyzing metadata. Based on the heatmap, a multi-level geohash coding strategy is formed to adapt geohash codes of different precision for different density geographic areas. The generated variable-length geohash partitioning prefix is ​​combined with the original data key to form a composite partitioning key, and the data is partitioned based on the composite partitioning key. The overflow processing unit is used to allocate a first metadata buffer and a second binary object buffer for the write task. It stores the metadata containing the composite partition key into the first metadata buffer that needs to be sorted, and stores the corresponding forestry binary object into the unsorted second binary object buffer. When either buffer reaches the overflow threshold, the binary objects in the second buffer are written sequentially to the second overflow file. The file identifier, position and length information are obtained and updated to the corresponding metadata entry in the first buffer. Then, the metadata in the first buffer is sorted based on the composite partition key and written to the first overflow file. The data output unit is used to start a pre-merge process for each target partition on an external Shuffle service node, continuously receive and multi-merge the first overflow file data blocks from the upstream task, and generate a globally ordered metadata output stream. The downstream merging task pulls the metadata output stream through a streaming channel with backpressure mechanism. It asynchronously pulls forestry binary objects based on the file identifier, location, and length information in the stream, and feeds back the consumption rate and buffer status to the external Shuffle service node so that it can dynamically adjust the transmission rate and block size of the metadata output stream.

9. The system according to claim 8, characterized in that, The process involves constructing a spatial data density heatmap for forestry data through sampling or analyzing metadata, and then forming a multi-level geohashing coding strategy based on the heatmap. Specifically: Set a preset resolution two-dimensional grid covering the geographic range of forestry data and initialize the count values ​​of all grid cells. Extract a preset proportion of forestry data as a sample set and traverse the data points to update the count values ​​of the corresponding grid cells. Convert the count values ​​into density values ​​based on the physical area of ​​the grid cells and perform normalization processing to form a density heatmap. Divide the geographic region into multiple density levels according to a preset density threshold, and then adapt geohash codes of different precision to each region according to the density level. The higher the density of the region, the higher the precision of the geohash code.

10. The system according to claim 8, characterized in that, The process of combining the generated variable-length geohash partition prefix with the original data key to form a composite partition key is as follows: For each forestry data record, its original data key is extracted, the geographic coordinates of the data record are obtained, and the encoding precision is determined according to the multi-level geohash encoding strategy. A geohash encoding string with a specified precision is generated for the geographic coordinates as a partition prefix. The partition prefix is ​​concatenated with the original data key using a preset delimiter to generate a composite partition key.