Distributed-based data processing method, device, apparatus and system
By dividing the source data into tile grids and performing extended copy processing in a distributed computing framework, the problems of slow processing speed and low accuracy of third-party source data in existing technologies are solved, achieving efficient and accurate data processing and map updates.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAVINFO
- Filing Date
- 2021-12-06
- Publication Date
- 2026-06-26
Smart Images

Figure CN116226299B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of map technology, and in particular to a distributed data processing method, apparatus, device and system. Background Technology
[0002] In the production of navigation map data, the production of point of information (POI) addresses is crucial because these addresses cover large areas such as residential communities and compounds. Traditional production processes primarily rely on manual street sweeping to collect POI data from various areas, followed by manual cleaning, verification, and data entry into the database to ensure the freshness of the parent database.
[0003] However, both the manual collection stage and the manual verification and storage stage consume a lot of time and manpower. Moreover, due to the long operation time, the collected point address data loses its freshness. With the rapid development of computer technology, the generation and storage of massive amounts of data have become possible. Using third-party source data related to map data to replace manual street sweeping data collection has the following advantages: (1) reducing a lot of manpower costs; (2) reducing a lot of time costs; (3) obtaining more data.
[0004] Currently, the processing of third-party source data mainly relies on database script operations. While this can alleviate the difficulty of manual processing to some extent, it still struggles to guarantee speed, complexity, and accuracy when dealing with tens of millions of data points. Therefore, existing technologies cannot process source data conveniently and accurately. Summary of the Invention
[0005] This application provides a distributed data processing method, apparatus, device, and system to overcome the problem that existing technologies cannot conveniently and accurately process source data.
[0006] In a first aspect, embodiments of this application provide a distributed data processing method, including:
[0007] Acquire source data related to the map and divide the source data into tile grids within different tiles;
[0008] Based on each tile, a first target tile is determined, and the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile is determined from the source data in the first target tile. The first adjacent edge data is then expanded into the corresponding first adjacent tile to obtain each tile to be deduplicated.
[0009] The target data is obtained through distributed computing and is used to update the map.
[0010] Secondly, embodiments of this application provide a distributed data processing apparatus, comprising:
[0011] The first data processing module is used to acquire source data related to the map and divide the source data into tile grids within different tiles;
[0012] The second data processing module is used to determine the first target tile based on each tile, and to determine the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile from the source data in the first target tile, and to expand the first adjacent edge data into the corresponding first adjacent tile to obtain each tile to be deduplicated.
[0013] The third data processing module is used to obtain target data through distributed computing, and the target data is used to update the map.
[0014] Thirdly, embodiments of this application provide a distributed data processing device, comprising: at least one processor and a memory;
[0015] The memory stores computer-executed instructions;
[0016] The at least one processor executes computer execution instructions stored in the memory, causing the at least one processor to perform the distributed data processing method described in the first aspect and various possible designs of the first aspect.
[0017] Fourthly, embodiments of this application provide a map update system, including: a source data collection terminal and a source data processing cloud platform; wherein, the source data processing cloud platform includes a processing device, a map update module, and a map distribution module;
[0018] The source data collection terminal is used to acquire map-related source data and transmit the source data to the processing device;
[0019] The processing device is used to divide the source data into tile grids within different tiles; determine a first target tile based on each tile, and determine the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile from the source data within the first target tile; expand the first adjacent edge data into the corresponding first adjacent tiles to obtain each tile to be deduplicated; obtain target data through distributed computing, and transmit the target data to the map update module;
[0020] The map update module is used to update the map according to the target data and transmit the updated map to the map distribution module;
[0021] The map distribution module is used to distribute the updated map to the corresponding terminal devices.
[0022] The distributed data processing method, apparatus, device, and system provided in this embodiment first acquire map-related source data and divide the source data into tile grids within different tile areas. Then, based on each tile, a first target tile is determined, and the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile is determined from the source data within the first target tile. The first adjacent edge data is then expanded into the corresponding first adjacent tiles to obtain each tile to be deduplicated. Finally, target data is obtained through distributed computing to update the map. By dividing the data into different tile grids and expanding and copying data between adjacent tiles, a distributed framework is used for deduplication comparison, which greatly improves the processing efficiency and accuracy for massive amounts of data. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 A schematic diagram illustrating a scenario of a distributed data processing method provided in an embodiment of this application;
[0025] Figure 2 A flowchart illustrating a distributed data processing method provided in an embodiment of this application;
[0026] Figure 3 A schematic diagram illustrating the tile and boundary definition provided in the embodiments of this application;
[0027] Figure 4 A schematic diagram illustrating the process of deduplicating source data provided in the embodiments of this application;
[0028] Figure 5 This is a schematic diagram of the gridded road network of the mother warehouse provided in an embodiment of this application;
[0029] Figure 6 A schematic diagram of the data matching process provided for embodiments of this application;
[0030] Figure 7 A flowchart illustrating a distributed data processing method provided in another embodiment of this application;
[0031] Figure 8 This is a schematic diagram of the structure of a distributed data processing device provided in an embodiment of this application;
[0032] Figure 9 This is a schematic diagram of the structure of a distributed data processing device provided in an embodiment of this application. Detailed Implementation
[0033] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0034] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the application described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0035] Currently, the processing of third-party source data mainly relies on database script operations. While this can alleviate the difficulty of manual processing to some extent, it still struggles to guarantee speed, complexity, and accuracy when dealing with tens of millions of data points. Therefore, existing technologies cannot process source data conveniently and accurately.
[0036] To address the problems of existing technologies, this application's technical concept is to use the Hadoop distributed computing framework for distributed computing, thereby significantly improving processing efficiency. To make the source data suitable for distributed computing, the source data is divided into different tile grids, and the data from these grids is distributed to multiple servers. Then, on each server, deduplication and differencing are performed on each tile within a defined range. Furthermore, to ensure accuracy, when comparing adjacent tile edges, the data at the boundary is copied to adjacent tiles by expanding the tile range. After comparing each tile individually, overall processing is performed. Then, by overlaying the tile grids of the source data (mainly containing address data) and the tile grids of the original road data (i.e., the parent database data), road information within a defined range surrounding the source data is obtained, thereby achieving map updates.
[0037] In practical applications, see Figure 1 As shown, Figure 1 This is a schematic diagram illustrating a scenario of the distributed data processing method provided in this application embodiment. The executing entity of this application can be a data processing device (e.g., a distributed data processing device). Here, the data processing device 10 can be a server cluster, including a main server 101 and multiple auxiliary servers 102 (taking two auxiliary servers as an example). The main server 101 is used to obtain map-related source data from a third party, such as address data and coordinates; each auxiliary server is used to perform deduplication and differential processing on the data within each tile.
[0038] Specifically, the main server obtains map-related source data from a third party and preprocesses it. Then, the preprocessed full data is randomly distributed to multiple auxiliary servers. Each auxiliary server uses the Mercator calculation model to divide the data into different tile grids based on the coordinates of the source data. Then, within a defined range, each tile on each auxiliary server undergoes deduplication and differencing. The allocation rule can be based on server resource availability; the same tile is processed by the same server. If that server becomes idle, it can continue processing other tiles, allowing one server to handle multiple tiles. As long as a server has available resources and there are still tasks to be completed, it will continue processing.
[0039] For example, secondary server A is responsible for processing data on tile A, and secondary server B is responsible for processing data on tile B. However, since the allocation is random, the data on tile A on server A may not be all the data in the entire tile A. Some data may be on server B. Therefore, when server A processes the data on tile A, it needs to retrieve the data on tile A allocated to server B so that server A can process the data in the entire tile A.
[0040] By using distributed computing, the data of each tile grid is deduplicated to obtain new source data. Then, the new source data is differentially matched with the parent database data to obtain the difference result, which is the target data, used to update the map. Due to the use of distributed computing, the processing efficiency is greatly improved, and it replaces manual processing and script operations, thus ensuring accuracy.
[0041] The technical solutions of this application will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0042] Figure 2 This is a flowchart illustrating a distributed data processing method provided in an embodiment of this application. The method may include:
[0043] S201. Obtain source data related to the map and divide the source data into tile grids within different tiles.
[0044] The source data here can include address data and the location coordinates of the address (i.e., the coordinates of the source data). The source data can be provided by a third party. After the main server in the data processing equipment obtains the source data from the third party, it can first preprocess the source data and then randomly distribute the preprocessed source data to multiple auxiliary servers. Then, according to the coordinates of the source data, tiles are divided, and the source data is distributed in the tile grid.
[0045] S202. Based on each tile, determine the first target tile, and determine the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile from the source data in the first target tile. Expand the first adjacent edge data into the corresponding first adjacent tile to obtain each tile to be deduplicated.
[0046] The first target tile here can be a tile that is adjacent to a preset number of first adjacent tiles.
[0047] Among them, the data in the tile to be deduplicated includes the data in the first adjacent tile and the data of the first adjacent edge.
[0048] In this embodiment, each tile can be used as a first target tile. For each first target tile, the data at the edges of adjacent tiles needs to be processed: First, the data of adjacent tiles is found from the source data within the first target tile. Then, by expanding the tile buffer (i.e., the range), the data at the boundary (i.e., adjacent edge data, such as the first adjacent edge data) is copied to the adjacent tiles. See also Figure 3 The diagram shows the tile and boundary definitions. If deduplication only involves comparing one type of data with itself, then only the following needs to be considered: Figure 3 The tiles within the dashed box expand the buffer (i.e., the range) in these four directions. Furthermore, the tiles within the dashed box are... Figure 3 The tile in the center.
[0049] S203. Obtain the target data through distributed computing.
[0050] The target data can be obtained through the following steps:
[0051] Step a1: Through distributed computing, deduplication is performed on the data of the tile grids in each tile to be deduplicated to obtain new source data.
[0052] In this data processing device, each auxiliary server uses the Mercator computation model to divide the data into different tile grids based on the coordinates of the source data. Then, each auxiliary server performs deduplication and differentiation within a limited range on each tile to obtain the new source data.
[0053] Step a2: Based on the source data, obtain parent database data within a first preset range, and perform differential deduplication on the new source data and the parent database data to obtain target data, which is used to update the map.
[0054] The target data includes address data, and the method may further include: updating the address data on the map based on the target data.
[0055] In this embodiment, the first preset range can be within 5 meters. Specifically, the parent database road data (i.e., parent database data) within 5 meters is obtained based on the coordinates of the source data. Then, the superimposed data of the parent database data and the source data is subjected to differential deduplication operation using the methods described in S202 to S203 above. Finally, the target data is obtained through similarity matching to update the map.
[0056] The distributed data processing method provided in this application acquires map-related source data and divides it into tile grids within different tile areas. Then, based on each tile, a first target tile is determined. From the source data within the first target tile, the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile is determined. This first adjacent edge data is then expanded into the corresponding first adjacent tiles to obtain each tile to be deduplicated. The first target tile is the tile adjacent to a preset number of first adjacent tiles. Finally, through distributed computation, target data is obtained to update the map. By dividing the data into different tile grids and expanding and copying data between adjacent tiles, a distributed framework is used for deduplication comparison, significantly improving the processing efficiency and accuracy for massive amounts of data.
[0057] In one possible design, this embodiment, based on the above embodiments, provides a detailed description of the distributed data processing method. The source data includes address data and coordinates; after acquiring the map-related source data, the distributed data processing method can further be implemented through the following steps:
[0058] Step a11: Extract the address data and coordinates from the source data.
[0059] Step a12: Based on the address data and coordinates, classify and split the source data to obtain address type, address name, and building address.
[0060] Among them, the point house number type, the point house number address name, and the point house number building number of the point house number are used to support the duplicate removal operation.
[0061] In this embodiment, the source data mainly comes from the unmatched source data in POI popularity mining and other third-party source data, and fields such as name, address, coordinates, and type are extracted. The specific preprocessing process is as follows:
[0062] (1) Filter data with types 120201 and 120202; (2) Calculate administrative division information based on coordinates; (3) Convert full-width characters to half-width characters; (4) Delete invalid character information such as "boarding point" and "alighting point" in the name; (5) Delete provincial, municipal, and district information in the name according to the administrative division information.
[0063] The specific process of preprocessing the classification and splitting of the source data is as follows:
[0064] Classification: Combine the custom point house number type division rules and use regular expressions to calculate the category of the name. The definition of the point house number type is shown in Table 1 below:
[0065] Table 1 Definition of Point House Number Type
[0066]
[0067]
[0068] Regular expression for name splitting: "[\\-~-A-Za-z0-9]+(building|residential building|unit|number|phase|building|block|seat)+.*", split the name into two parts: M_DPR_NAME (i.e., the point house number address name) and M_DP_NAME (i.e., the point house number building number). The specific splitting rules are as follows:
[0069] [[ID=**28**]]1) If a substring can be matched, determine the splitting position (i.e., split_idx) according to the starting position of any pattern of the matched substring, and intercept the string before the split_idx position as M_DPR_NAME; [[ID=**29**]] [[ID=**30**]]
[0070] [[ID=**31**]]2) Match M_DPR_NAME with the regular expression "(first|A|B)+$". If any substring can be matched, update split_idx according to the starting position of the substring, and intercept the string before the split_idx position to update M_DPR_NAME; [[ID=**32**]] [[ID=**33**]]
[0071] [[ID=**34**]]3) Intercept the string after the split_idx position as M_DP_NAME. If M_DP_NAME starts with "first", delete the leading "first" character. [[ID=**35**]] [[ID=**36**]]
[0072] 4) If no substring is matched, the name is used as M_DPR_NAME, and M_DP_NAME is empty. See Table 2.
[0073] Table 2 Examples of Name Splitting
[0074]
[0075] In one possible design, this embodiment provides a detailed description of S202 based on the above embodiment. Dividing the source data into tile grids within different tiles can be achieved through the following steps:
[0076] Step b1: Based on the coordinates of the source data, calculate the Mercator coordinates of the source data using the Mercator projection law, and calculate the coordinates of the source data within the tile.
[0077] Step b2: Use the Mercator coordinates of the source data as the tile number, and use the coordinates of the source data within the tile as the tile grid number within the tile.
[0078] Step b3: Map the source data to different tile grids according to the tile number and the tile grid number within the tile.
[0079] In this embodiment, duplicate data with completely identical M_DPR_NAME and M_DP_NAME records within a 2-kilometer radius can be deduplicated, and only one record is randomly retained. Since the amount of source data is large, reaching millions, traditional deduplication within a limited range would be extremely time-consuming. Therefore, a tile-splitting method is used for distributed computing deduplication. During the deduplication process, the master server can first randomly distribute the entire dataset to various auxiliary servers, and then each auxiliary server performs tile partitioning and data processing on the source data. The specific process of tile partitioning can be as follows:
[0080] Based on the source data coordinates (i.e., the coordinates of the source data, such as the coordinates corresponding to the door number data), the Mercator coordinates (i.e., keyTile, in the form of keyTileX_keyTileY) are calculated using the Mercator projection law (or Mercator calculation model) as the tile number, and the tile internal coordinates (i.e., tileX and tileY) are calculated as the tile internal grid number (i.e., the tile grid number).
[0081] For example, using a Mercator computational model with a level of 12 and a tile size of 1024, each tile can be divided into 1024 grids, with each grid having a side length of 10m. Therefore, a 2km range can be converted into 200 grids. By calculating the tile number and grid number where the source data coordinates are located, each data point can be mapped to a different tile grid.
[0082] In one possible design, this embodiment, based on the above embodiments, provides a detailed explanation of how to determine the supplementary data. This can be achieved through the following steps:
[0083] Step c1: Determine each first adjacent tile from the tiles adjacent to the first target tile.
[0084] Step c2: For each piece of data within the first target tile, determine whether the data is supplementary data by comparing it with a preset threshold based on the tile number of the first target tile and the grid number where the data is located.
[0085] Step c3: If the data is expanded data, then determine the first adjacent tile corresponding to the position where the expanded data is expanded, and the expanded data is the first adjacent edge data.
[0086] In this embodiment, determining each first adjacent tile from the tiles adjacent to the first target tile can involve finding tiles containing a class of data adjacent to the first target tile. For cases where adjacent tiles need to be compared and deduplicated within a limited range, the data at the boundary is copied to the adjacent tiles by expanding the tile buffer.
[0087] For example, in combination Figure 3 As shown. Since this deduplication only involves comparing one type of data with itself, it only needs to be performed on the data itself. Figure 3 The tile within the dashed box expands the buffer (i.e., the range) in four directions: the tile at the top, top left, left, and bottom left corners is the first adjacent tile. For example... Figure 3 The tile numbered keyTileX_keyTileY-1 is shown above the tile adjacent to it, the tile numbered keyTileX-1_keyTileY-1 is adjacent to it at its upper left corner, the tile numbered keyTileX-1_keyTileY is adjacent to it at its left corner, and the tile numbered keyTileX-1_keyTileY+1 is adjacent to it at its lower left corner.
[0088] Specifically, 1) If the tile number keyTileX where the data is located is greater than 0, and the grid number (i.e., the tile grid number) tileX <= 200, that is, the area to the left of the left vertical line in the middle tile (as a buffer area), then the data in this area needs to be copied (i.e. expanded) to the adjacent tile on the left for deduplication. Add tileX to the number of grids in the tile, 1024, as the grid number for the adjacent tile on the left to buffer out to the right (i.e., expand the range to the right).
[0089] 2) If the tile number keyTileY where the data is located is greater than 0, and the grid number tileY is less than or equal to 200 (i.e., the area above the upper horizontal line in the middle tile, serving as the buffer area), then the data in this area needs to be copied to the adjacent tile above the middle tile for deduplication. Add tileY to the number of grids in the tile (1024) to obtain the grid number buffered downwards from the adjacent tile above.
[0090] 3) If the tile number keyTileX containing the data is greater than 0, and the grid number tileX <= 200, tileY >= 1024 - 200, that is, the lower left corner area where the left vertical line and the lower horizontal line intersect in the middle tile (as the buffer area), then the data in this area needs to be copied to the adjacent tile in the lower left corner for deduplication. Add the number of grids in the tile (1024) to tileX, and subtract the number of grids in the tile (1024) from tileY, to obtain the grid number of the adjacent tile in the lower left corner that is buffered out to the upper right corner.
[0091] 4) If the tile number keyTileX is greater than 0 and keyTileY is greater than 0, and the grid number tileX <= 200 and tileY <= 200 (i.e., the upper left corner area where the left vertical line intersects the upper horizontal line in the middle tile, serving as the buffer area), then the data in this area needs to be copied to the adjacent tile in the upper left corner for deduplication. Add 1024 to the number of grids in the tile to tileX, and add 1024 to the number of grids in the tileY, to obtain the grid number from the upper left adjacent tile to the lower right buffer.
[0092] In one possible design, this embodiment, based on the above embodiment, provides a detailed explanation of how to remove duplicates. By using distributed computing to remove duplicate data from the tile grids within each tile to be deduplicated, and obtaining new source data, this can be achieved through the following steps:
[0093] Step d1: For each tile to be deduplicated, perform the following steps: If, within the tile to be deduplicated, two data items have the same name and the difference between the grid numbers of the two data items is less than or equal to the first preset value, then the two data items are determined to be duplicated, and the two data items are marked as duplicates.
[0094] Step d2: Perform overall deduplication on each tile to be deduplicated according to the markings to obtain new source data.
[0095] In this embodiment, deduplication and labeling are performed within the tiles. If two data entries have identical names and the difference in their grid numbers (i.e., diffTileX and diffTileY) does not exceed 200, the data is considered duplicated. The specific deduplication and labeling rules are as follows (in conjunction with...). Figure 4 The diagram illustrates the source data deduplication process, where nobuf-list represents the data list outside the buffer area, buf-list represents the data list within the buffer area, and buf_re-list represents the list of duplicate buf entries.
[0096] a. If data within the buffer area (i.e., buf) is duplicated with data outside the buffer area (i.e., nobuf), then mark the duplicate data buf within the buffer area as buf_re;
[0097] b. If the data nobuf outside the buffer area is not repeated in the non-buffer area, but is repeated with the data buf in the buffer area, then mark the data outside the buffer area as not repeated, and mark the data buf in the buffer area as repeated as buf_re;
[0098] c. If the data buf within the buffer area is not repeated within the buffer area, then it is marked as non-repeating data within the buffer area;
[0099] d. If data outside the buffer area is repeated within the buffer area, it is marked as repeated data outside the buffer area;
[0100] e. Retain data marked as non-repeating in the non-buffer area and data marked as repeating in the buffer area.
[0101] Then, for the data retained by the deduplication markers in all tiles, deduplication is performed based on the markers as a whole. Combined with the unique identifier (UUID) of the data marked as duplicates within the buffer area, the data with the UUID is deleted from the full dataset, thus completing the deduplication process. For example, for the marked data in all tiles and the buffer area, the IDs of these data are found, and data with these IDs are deleted from the full dataset.
[0102] In one possible design, S204 can be achieved through the following steps:
[0103] Step e1: Based on the coordinates of the new source data and the coordinates of the parent database data, determine the tile number and grid number where the new source data and the parent database data are located, respectively, so as to map the new source data and the parent database data to the tile grids within different tiles.
[0104] Step e2: Using either the new source data or the parent database data as data for defining the scope expansion, determine the second target tile from each tile mapped to the new source data and the parent database data.
[0105] Step e3: Obtain the second adjacent edge data corresponding to the data used for expanding the defined range from the second target tile, and expand the second adjacent edge data into the corresponding second adjacent tile, where the second adjacent tile is the tile adjacent to the second target tile.
[0106] Step e4: Through distributed computing, within a second preset range, differential matching is performed on the data of the tile grid within each tile to which the new source data and the parent database data are mapped, to obtain target data. The target data is used to represent data that differs from the parent database data.
[0107] In this embodiment, the parent database road data (i.e., parent database data) within a first preset range (e.g., within 5 meters) is first obtained based on the source data coordinates. The specific steps are as follows:
[0108] 1) See Figure 5 The diagram shows a schematic of the gridded road data in the parent database. The gridded parent database road data also uses a Mercator computation model with a level of 12 and a tile size of 1024. Road coordinate points are divided into different tile grids to generate grid strings. Due to the sparsity of the grid strings, the Bresenham algorithm is used to interpolate adjacent grids sequentially, and then the grid is expanded based on attributes such as road level and number of lanes. Thus, a single tile grid may fall on the coordinates of multiple roads; therefore, information on all roads falling on it is recorded in each grid. See also... Figure 5 The diagram shows the mapping of road coordinates to a tile grid, where curves represent road lines, circles (or points) represent road coordinate points, grid Ii represents the grid containing the road coordinate points, and grid II represents the interpolation grid.
[0109] 2) Based on the tile and grid number where the source data is located, obtain the road information in the corresponding tile grid, calculate the shortest distance between the source data coordinate point and the road coordinate string, and if the distance is less than 5 meters, record the ID of the corresponding road.
[0110] Before performing differential matching between the new source data and the parent database data, preprocessing can be performed:
[0111] (1) Obtain the PID, DPR_NAME (i.e., the name of the point house number address in the master database), DP_NAME (i.e., the building house number of the point gateway in the master database), TYPE (i.e., the type of the point house number in the master database), and coordinate information from the point house number table (i.e., IX_POINTADDRESS) in the master database. For the data with an empty DPR_NAME, obtain the DPR_NAME field from the corresponding record in the point house number sub-table in the master database and replace it; (2) Process the M_DPR_NAME, M_DP_NAME of the source data and the DPR_NAME, DP_NAME of the master database data:
[0112] 1) Convert full-width characters to half-width characters; 2) If the end is "No." or "Building", remove "No." and "Building"; 3) If the number starts with "0", remove the leading "0"; 4) Delete the words "Community" and "Residential Community"; 5) If DP_NAME starts with the character "Di", delete this character; 6) Convert Chinese numerals to Arabic numerals.
[0113] The specific process of differential matching is as follows:
[0114] Within a range of 2 kilometers, compare the M_DPR_NAME + M_DP_NAME (i.e., M_DPR_NAME and M_DP_NAME) of the deduplicated source data (i.e., the new source data) and the DPR_NAME + DP_NAME (i.e., DPR_NAME and DP_NAME) of the master database data for differential matching. Since the amount of source data reaches the million level and the amount of master database point house number data reaches the ten million level, using the traditional method for matching within a limited range is very time-consuming and resource-consuming. Here, use a Mercator calculation model with a level of 12 and a tile size of 1024. Divide the data by calculating the tile grid number, and then perform distributed matching calculations.
[0115] The specific steps are as follows:
[0116] (1) Calculate the tiles and grid numbers where the source data and the point house number data in the master database (i.e., the point house number data in the master database) are located according to the coordinates. For the case where matching within a limited range is required at the adjacent tile boundaries, copy the data at the boundaries to the adjacent tiles by expanding the tile buffer. The way of expanding the tile buffer here is different from the way when deduplicating. Because it involves the comparison of two types of data, it is only necessary to expand the buffer of one type of data in 8 directions (combined with Figure 3(As shown). Here, we select the address data from the parent database for buffer expansion. Since the buffer expansion methods for the left, top, bottom left, and top left corners have already been explained during the source data deduplication process, they will not be repeated here; please refer to the source data deduplication method. Below, we only list the buffer expansion methods for the remaining four directions: right, bottom, bottom right, and top right corner tiles (e.g.,...). Figure 3 The tiles shown are: the tile below keyTileX_keyTileY+1, the tile to the upper right of keyTileX+1_keyTileY-1, the tile to the right of keyTileX+1_keyTileY, and the tile to the lower right of keyTileX+1_keyTileY+1.
[0117] 1) If the grid number tileX (using the address data from the parent database as an example) is greater than or equal to 1024 - 200, that is, the area to the right of the right vertical line in the middle tile (as a buffer area), then the data in this area needs to be copied to the adjacent tile on the right for deduplication. Subtract the number of grids in the tile (1024) from tileX to obtain the grid number of the adjacent tile on the right that is buffered to the left.
[0118] 2) If the grid number tileY where the data is located is greater than or equal to 1024 - 200, that is, the area below the lower horizontal line in the middle tile (as the buffer area), then the data in this area needs to be copied to the adjacent tile below for deduplication. Subtract the number of grids in the tile, 1024, from tileY to obtain the grid number buffered upwards from the adjacent tile below.
[0119] 3) If the tile number keyTileY where the data is located is greater than 0, and the grid number tileX >= 1024 - 200, and tileY <= 200, that is, the upper right corner area where the upper horizontal line and the right vertical line intersect in the middle tile (as the buffer area), then the data in this area needs to be copied to the adjacent tile in the upper right corner for deduplication. Subtract the number of grids in the tile (1024) from tileX, and add the number of grids in the tile (1024) to tileY, to obtain the grid number of the adjacent tile in the upper right corner that is buffered out to the lower left corner.
[0120] 4) If the grid number of the data is tileX>=1024-200 and tileY>=1024-200, that is, the lower right corner area where the bottom horizontal line and the right vertical line intersect in the middle tile (as the buffer area), then the data in this area needs to be copied to the adjacent tile in the lower right corner for deduplication. Subtract the number of grids in the tile (1024) from tileX and subtract the number of grids in the tile (1024) from tileY to obtain the grid number of the adjacent tile in the lower right corner that is buffered out to the upper left corner.
[0121] (2) Within a defined range within the tile, compare the source data and the parent database data, calculate the similarity score, and provide a matching description. The matching rules are as follows (see...). Figure 6 (See the flowchart for the data matching process shown):
[0122] 1) If the difference between the grid numbers of the source data and the parent database data, diffTileX and diffTileY, does not exceed 200, then compare the source data's M_DPR_NAME+M_DP_NAME (abbreviated as m_name) with the parent database data's DPR_NAME+DP_NAME (abbreviated as name);
[0123] 2) If both m_name and name are not empty and are exactly the same, the similarity score is 1, and the matching description is "complete match";
[0124] 3) If m_name and name are both non-empty and not completely identical, calculate the string similarity using Levenstein edit distance and use it as the similarity score;
[0125] 4) Extract the numeric sequences from m_name and name respectively. If m_name and name have an inclusion relationship in characters and the numeric sequences are exactly the same, then the match is defined as "inclusion".
[0126] 5) If both m_name and name can be used to extract numeric sequences, and the data sequences have an inclusion relationship or are completely identical, then the matching description is "approximate matching";
[0127] 6) In other cases, the matching description is "no match".
[0128] If multiple parent address numbers are compared with one source number, the parent address number with the highest similarity score is selected as the matching data.
[0129] (3) Select data from the final matching results that have no roads within 5 meters, are described as "no match", or are described as "approximate match" with a similarity score of less than 0.7 as the difference data between the source data and the parent database address data. Update the map based on the difference data.
[0130] Figure 7 This is a flowchart illustrating a distributed data processing method provided in another embodiment of this application. In this embodiment, source data is first obtained, then the source data is preprocessed, then the source data is classified and split after preprocessing, then the source data is deduplicated, then the parent database data is obtained, then the parent database data is preprocessed before matching, then differential matching is performed on the deduplicated source data and the parent database data, and finally differential data is obtained.
[0131] This application first employs a Mercator tile grid to break down massive source data into smaller parts, enabling distributed deduplication and differentiation. By comparing within a limited range, it abandons the traditional method of calculating distance using coordinates, directly defining a rectangular range through the difference in grid numbers. Then, using the Mercator tile grid, roads are mapped to multiple continuous grids, and the buffer is expanded according to road attributes. By overlaying grids, it becomes possible to find lines around specified points on a distributed framework.
[0132] Specifically, this application uses a Mercator computation model with a level of 12 and a tile size of 1024. Each tile can be divided into 1024 grids, with each grid having a side length of 10m. Therefore, a 2km range can be converted into 200 grids. By calculating the tile and grid number where the source data coordinates are located, each data point can be mapped to a different tile grid. During distributed computation comparison, comparisons are first performed on a tile-by-tile basis within a defined range, and then processed as a whole. For deduplication, when adjacent tiles need to be compared within a defined range, the data at the boundary is copied to adjacent tiles by expanding the tile buffer. Deduplication only involves comparing one type of data with itself, requiring expansion of only the four directions of the upper left corner. Differentiation involves two types of data, requiring expansion of one type of data to eight directions. The comparison within a defined range in existing technologies is converted into calculation of grid numbers, thus transforming coordinate distance calculation into simple integer subtraction. Road coordinates are divided into different tile grids to generate grid strings. Due to the sparsity of the grid strings, the Bresenham algorithm is used to interpolate adjacent grids in the grid string sequentially. The grid is then expanded based on attributes such as road level and number of lanes. Thus, a single tile grid may contain the coordinates of multiple roads; therefore, information about all roads within each grid is recorded. By comparing the source data with the tile grid numbers of the roads, road information surrounding the source data can be easily obtained.
[0133] Therefore, this application significantly improves the processing efficiency for massive datasets by dividing data into different tile grids and copying data between adjacent tiles using an expanded buffer, thereby employing a distributed framework for deduplication comparison. By refining the data processing rules, the difference results are made more accurate. Using structured data processing code instead of traditional processing scripts simplifies subsequent problem tracking, code improvement, and maintenance.
[0134] To implement the aforementioned distributed data processing method, this embodiment provides a distributed data processing apparatus. See also... Figure 8 , Figure 8 This is a schematic diagram of the structure of a distributed data processing device provided in an embodiment of this application. The distributed data processing device 80 includes: a first data processing module 801, used to acquire source data related to the map and divide the source data into tile grids within different tiles; a second data processing module 802, used to determine a first target tile based on each tile, and to determine the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile from the source data within the first target tile, and to expand the first adjacent edge data into the corresponding first adjacent tiles to obtain each tile to be deduplicated; and a third data processing module 803, used to obtain target data through distributed computing, the target data being used to update the map.
[0135] In this embodiment, a first data processing module 801, a second data processing module 802, and a third data processing module 803 are configured to acquire source data related to the map, divide the source data into tile grids within different tiles, determine a first target tile based on each tile, and identify the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile from the source data within the first target tile. The first adjacent edge data is then expanded into the corresponding first adjacent tiles to obtain each tile to be deduplicated. Finally, through distributed computing, the target data is obtained to update the map. By dividing the data into different tile grids and expanding and copying data between adjacent tiles, a distributed framework is used for deduplication comparison, greatly improving the processing efficiency and accuracy for massive amounts of data.
[0136] The apparatus provided in this embodiment can be used to execute the technical solutions of the above method embodiments. Its implementation principle and technical effects are similar, and will not be described again here.
[0137] In one possible design, the third data processing module is specifically used to: deduplicate the data of the tile grids within each of the tiles to be deduplicated using distributed computing to obtain new source data; the fourth data processing module 804 is used to obtain parent database data within a first preset range based on the source data, and to perform differential deduplication between the new source data and the parent database data.
[0138] In one possible design, the first data processing module is specifically used for:
[0139] Based on the coordinates of the source data, the Mercator coordinates of the source data are calculated using the Mercator projection law, and the coordinates of the source data within the tile are also calculated.
[0140] The Mercator coordinates of the source data are used as tile numbers, and the coordinates of the source data within the tile are used as tile grid numbers within the tile.
[0141] Based on the tile number and the tile grid number within the tile, the source data is mapped to different tile grids.
[0142] In one possible design, the second data processing module is specifically used for:
[0143] Each first adjacent tile is determined from the tiles adjacent to the first target tile;
[0144] For each piece of data within the first target tile, based on the tile number of the first target tile and the grid number where the data is located, it is determined whether the data is supplementary data by comparing it with a preset threshold.
[0145] If the data is augmented data, then the first adjacent tile corresponding to the position where the augmented data is augmented is determined, and the augmented data is the first adjacent edge data.
[0146] In one possible design, the third data processing module is specifically used for:
[0147] For each tile to be deduplicated, the following steps are performed: within the tile to be deduplicated, if two data have the same name within a second preset range, and the difference between the grid numbers of the two data are less than or equal to a first preset value, then the two data are determined to be duplicates, and the two data are marked for deduplication.
[0148] Based on the markers, perform overall deduplication on each tile to be deduplicated to obtain new source data.
[0149] In one possible design, the fourth data processing module is specifically used for:
[0150] Based on the coordinates of the new source data and the coordinates of the parent database data, the tile number and grid number of the new source data and the parent database data are determined respectively, so as to map the new source data and the parent database data to the tile grids in different tiles;
[0151] Using either the new source data or the parent database data as data for expanding the scope, a second target tile is determined from each tile mapped to the new source data and the parent database data.
[0152] Obtain the second adjacent edge data corresponding to the data used for expanding the defined range from the second target tile, and expand the second adjacent edge data into the corresponding second adjacent tile, wherein the second adjacent tile is the tile adjacent to the second target tile;
[0153] Through distributed computing, within a second preset range, the data of the tile grid within each tile mapped to the new source data and the parent database data are deduplicated and matched to obtain target data, which is used to represent data that differs from the parent database data.
[0154] In one possible design, the source data includes point address data and coordinates; the device may further include: a fifth data processing module; the fifth data processing module is used to: after acquiring the map-related source data, extract the point address data and coordinates from the source data; classify and split the source data according to the point address data and coordinates to obtain point address type, point address name, and point address building number; the point address type, the point address name, and the point address building number are used to support deduplication operations.
[0155] In one possible design, the target data includes address data, and the device may further include: a map update module; the map update module is used to update the address data on the map according to the target data.
[0156] To implement the distributed data processing method, this embodiment provides a map update system, including: a source data collection terminal and a source data processing cloud platform.
[0157] The source data processing cloud platform includes a processing device, a map update module, and a map distribution module.
[0158] The source data collection terminal is used to acquire map-related source data and transmit the source data to the processing device;
[0159] The processing device is used to divide the source data into tile grids within different tiles; determine a first target tile based on each tile, and determine the first adjacent edge data corresponding to each first adjacent tile adjacent to the first target tile from the source data within the first target tile; expand the first adjacent edge data into the corresponding first adjacent tiles to obtain each tile to be deduplicated; obtain target data through distributed computing, and transmit the target data to the map update module;
[0160] The map update module is used to update the map according to the target data and transmit the updated map to the map distribution module;
[0161] The map distribution module is used to distribute the updated map to the corresponding terminal devices.
[0162] The specific implementation of the processing device can refer to the technical solution of the above-described method embodiment with a distributed data processing device as the execution subject.
[0163] The system provided in this embodiment can be used to execute the technical solutions of the above method embodiments. Its implementation principle and technical effect are similar, and will not be described again here.
[0164] To implement the aforementioned distributed data processing method, this embodiment provides a distributed data processing device. Figure 9 This is a schematic diagram of the structure of a distributed data processing device provided in an embodiment of this application. Figure 9 As shown, the distributed data processing device of this embodiment includes a processor 901 and a memory 902; wherein, the memory 902 is used to store computer execution instructions; the processor 901 is used to execute the computer execution instructions stored in the memory to implement the various steps performed in the above embodiment. For details, please refer to the relevant descriptions in the foregoing method embodiments.
[0165] This application also provides a computer-readable storage medium storing computer-executable instructions. When a processor executes the computer-executable instructions, it implements the distributed data processing method described above.
[0166] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the distributed data processing method described above.
[0167] In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms. Additionally, the functional modules in the various embodiments of this application may be integrated into one processing unit, or each module may exist physically separately, or two or more modules may be integrated into one unit. The above-mentioned modular units can be implemented in hardware or in the form of hardware plus software functional units.
[0168] The integrated modules implemented as software functional modules described above can be stored in a computer-readable storage medium. These software functional modules, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of this application. It should be understood that the processor may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in this invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor.
[0169] The memory may include high-speed RAM, and may also include non-volatile memory (NVM), such as at least one disk drive, and may also be a USB flash drive, external hard drive, read-only memory, disk, or optical disc. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses in the accompanying drawings are not limited to a single bus or a single type of bus. The aforementioned storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, disk, or optical disc. The storage medium can be any available medium accessible to general-purpose or special-purpose computers.
[0170] An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium can be an integral part of the processor. Both the processor and the storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and storage medium can exist as discrete components in an electronic device or host device.
[0171] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0172] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A distributed data processing method, characterized in that, include: Acquire source data related to the map and divide the source data into tile grids within different tiles; Based on each tile, a first target tile is determined, and from the source data within the first target tile, the first adjacent edge data located in the edge region of the first target tile and adjacent to at least one first adjacent tile is determined. The first adjacent edge data is then expanded into the corresponding first adjacent tile to obtain each tile to be deduplicated, wherein the tile to be deduplicated includes its own source data and the first adjacent edge data expanded from the first target tile. Through distributed computing, data deduplication is performed independently in each of the tiles to be deduplicated to obtain target data, which is used to update the map.
2. The method according to claim 1, characterized in that, Through distributed computing, data deduplication is performed independently within each of the tiles to be deduplicated, resulting in target data, including: Through distributed computing, the data of the tile grids in each tile to be deduplicated is deduplicated to obtain new source data; Based on the source data, obtain parent database data within a first preset range, and perform differential deduplication on the new source data and the parent database data to obtain target data.
3. The method according to claim 1, characterized in that, From the source data within the first target tile, determine the first adjacent edge data located in the edge region of the first target tile and adjacent to at least one first adjacent tile, including: Each first adjacent tile is determined from the tiles adjacent to the first target tile; For each piece of data within the first target tile, based on the tile number of the first target tile and the grid number where the data is located, it is determined whether the data is supplementary data by comparing it with a preset threshold. If the data is augmented data, then the first adjacent tile corresponding to the position where the augmented data is augmented is determined, and the augmented data is the first adjacent edge data.
4. The method according to claim 2, characterized in that, Through distributed computing, duplicate data is removed from the tile grids within each tile to be deduplicated, resulting in new source data, including: For each tile to be deduplicated, the following steps are performed: within the tile to be deduplicated, if two data have the same name within a second preset range, and the difference between the grid numbers of the two data are less than or equal to a first preset value, then the two data are determined to be duplicates, and the two data are marked for deduplication. Based on the markers, perform overall deduplication on each tile to be deduplicated to obtain new source data.
5. The method according to claim 2, characterized in that, The target data is obtained by performing differential deduplication on the new source data and the parent database data, including: Based on the coordinates of the new source data and the coordinates of the parent database data, the tile number and grid number of the new source data and the parent database data are determined respectively; Using either the new source data or the parent database data as data for expanding the scope, a second target tile is determined from each tile mapped to the new source data and the parent database data. Obtain the second adjacent edge data corresponding to the data used for expanding the defined range from the second target tile, and expand the second adjacent edge data into the corresponding second adjacent tile, wherein the second adjacent tile is the tile adjacent to the second target tile; Target data is obtained through distributed computing. The target data is used to represent data that differs from the parent database data.
6. The method according to any one of claims 1-4, characterized in that, The source data includes address data and coordinates; after acquiring the map-related source data, the method further includes: Extract the address data and coordinates from the source data; Based on the address data and coordinates, the source data is classified and split to obtain address type, address name, and building address.
7. The method according to claim 6, characterized in that, The target data includes address number data, and the method further includes: Update the address data on the map based on the target data.
8. A distributed data processing device, characterized in that, include: The first data processing module is used to acquire source data related to the map and divide the source data into tile grids within different tiles; The second data processing module is used to determine a first target tile based on each tile, and to determine the first adjacent edge data located in the edge region of the first target tile and adjacent to at least one first adjacent tile from the source data in the first target tile, and to expand the first adjacent edge data into the corresponding first adjacent tile to obtain each tile to be deduplicated, wherein the tile to be deduplicated includes its own source data and the first adjacent edge data expanded from the first target tile; The third data processing module is used to independently perform data deduplication processing in each of the tiles to be deduplicated through distributed computing to obtain target data, which is used to update the map.
9. A distributed data processing device, characterized in that, include: At least one processor and memory; The memory stores computer-executed instructions; The at least one processor executes computer execution instructions stored in the memory, causing the at least one processor to perform the distributed data processing method as described in any one of claims 1 to 7.
10. A map updating system, characterized in that, include: The system includes a source data collection terminal and a source data processing cloud platform; wherein the source data processing cloud platform comprises a processing device, a map update module, and a map distribution module. The source data collection terminal is used to acquire map-related source data and transmit the source data to the processing device; The processing device is used to divide the source data into tile grids within different tiles; determine a first target tile based on each tile, and determine first adjacent edge data located in the edge region of the first target tile and adjacent to at least one first adjacent tile from the source data within the first target tile; expand the first adjacent edge data into the corresponding first adjacent tile to obtain each tile to be deduplicated, wherein each tile to be deduplicated includes its own source data and the first adjacent edge data expanded from the first target tile; perform data deduplication processing independently within each tile to be deduplicated through distributed computing to obtain target data, and transmit the target data to the map update module; The map update module is used to update the map according to the target data and transmit the updated map to the map distribution module; The map distribution module is used to distribute the updated map to the corresponding terminal devices.