Method, device, and program product for constructing an index of astronomical data based on an r-tree
By using an R-tree-based astronomical data indexing method, combined with HEALPix and LRU caching strategies, the problems of high computational overhead and insufficient index reusability in astronomical data retrieval are solved, achieving efficient and stable astronomical data retrieval and index reuse.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YUNNAN OBSERVATORY CHINESE ACADEMY OF SCIENCES
- Filing Date
- 2025-10-15
- Publication Date
- 2026-06-23
Smart Images

Figure CN121301347B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of information retrieval, specifically to the field of astronomical data retrieval, and more specifically to a method, apparatus, device, medium, and program product for constructing an astronomical data index based on an R-tree. Background Technology
[0002] With the development of science and technology, astronomical multi-band data has increased dramatically, with observations conducted across gamma rays, X-rays, ultraviolet, optical, infrared, and radio bands. The total amount of accumulated astronomical observation data has reached the petabyte (PB) level. As an observational science, scientific discoveries heavily rely on the research and analysis of observational data. To retrieve the desired data, current astronomical searches involve users entering the coordinates (ra0, dec0) of the query center and the query angular radius (...) in the database. Figure 4 (within the red circle), the database returns the query results. Depending on the query target, astronomical retrieval is also divided into cone retrieval or nearest neighbor retrieval. Cone retrieval returns all celestial objects within the query range; nearest neighbor retrieval returns the one (or several) celestial objects closest to the coordinates entered by the user.
[0003] Although the above methods are widely used and relatively mature in engineering, there are still several key issues that restrict retrieval efficiency and stability when facing ultra-large-scale star catalogs: (1) After obtaining the data of pixel blocks through initial screening, the angular distance still needs to be calculated for each record, resulting in high computational overhead. (2) There is a contradiction between finer partitioning and preprocessing / storage costs. For example, improving partitioning precision (higher order) can reduce redundant targets within a single pixel, thereby reducing the total amount of subsequent precise judgment and reducing subsequent computational load; however, higher partitioning precision will bring more block files, higher metadata overhead, and longer preprocessing time (data splitting, index building, file management), and may increase storage fragmentation and file system burden, so it cannot bring net benefits in all scenarios. (3) There is a lack of efficient batch spatial filtering mechanisms. Many traditional practices still use linear (line-by-line) judgment after initial screening, lacking a secondary filtering structure that can quickly eliminate or confirm a large number of candidate records based on spatial features.
[0004] An astronomical data indexing method is disclosed in the prior art, and its main technical concept is as follows: Figure 4 As shown: Based on the HEALPix index, for the HEALPix pixel blocks covered by the query range ( Figure 4The method categorizes pixels 1-16: pixels 7, 6, 10, and 11 are entirely within the query range and do not require angular distance calculation, thus being directly considered as query results; the remaining pixels partially fall outside the query range (red circle), so only the angular distance from the target within these pixels to (ra0, dec0) is calculated to determine if it is within the query range. This method reduces the number of pixels in HEALPix that need angular distance calculation to determine if they are within the range through classification filtering, allowing for faster query results and partially solving the spatial filtering problem. However, because the angular distance from the target within pixels 7, 6, 10, and 11 to (ra0, dec0) is not calculated, nearest neighbor retrieval is not possible. Researchers cannot accurately determine the closest data to their target based on angular distance. When the closest data needs to be retrieved, the computational cost of angular distance calculation is the same as traditional methods, and it does not reduce the computational cost. Furthermore, this filtering method requires this process for every query, making it non-reusable and still very time-consuming for multiple queries. Additionally, its inaccurate distance calculation limits its application scenarios. Summary of the Invention
[0005] In view of the above problems, this application provides a method, apparatus, device, medium and program product for constructing an R-tree-based astronomical data index that improves the efficiency of astronomical data retrieval and is reusable.
[0006] According to the first aspect of this application, a method for constructing an astronomical data index based on an R-tree is provided, comprising: obtaining an astronomical star table, the astronomical star table including multiple data rows, the fields of the astronomical star table including spatial location; dividing the celestial coordinate system into multiple pixel blocks with different spatial codes based on a predetermined celestial region division protocol; mapping the data rows to the corresponding pixel blocks based on the spatial location of the data rows to obtain the spatial code corresponding to the data rows; dividing the data rows with the same spatial code in the astronomical star table into a data block to obtain multiple data blocks; constructing an R-tree index for each data block in the multiple data blocks based on the spatial location of the data rows to obtain the block index of the data block; and obtaining the astronomical data index by mapping each spatial code to the block index one by one.
[0007] According to an embodiment of this application, the R-tree index includes a root node, internal nodes, and leaf nodes, with at least one internal node between the root node and the leaf nodes; the keys of the root node and internal nodes of the block index are the minimum outer space of the data block, and the values are the minimum outer space of each child node stored in the root node; the keys of the leaf nodes of the block index are the minimum outer space of the data row of the data entry stored in the leaf node, and the values are the data row or a pointer to the data row.
[0008] According to an embodiment of this application, the node capacity of internal nodes and leaf nodes is within a set range when constructing the R-tree index.
[0009] According to an embodiment of this application, the predetermined sky area division protocol is a multi-level equal area division method, and the division order is determined based on the pixel base number in each dimension of the multi-level equal area division method.
[0010] According to an embodiment of this application, the astronomical data index is obtained by mapping each spatial code to a block index one-to-one, including: constructing a numerical index for the spatial code to obtain a first index, wherein the leaf node key of the first index is the spatial code; and setting the leaf node value of the first index to a block index or a pointer to the block index to obtain the astronomical data index.
[0011] According to an embodiment of this application, the node set of the astronomical data index is partitioned and stored in multiple physical files, and the mapping relationship between the partitioned storage node set and the physical files is stored in a mapping table. The mapping table records the mapping relationship between each node identifier and its storage location descriptor.
[0012] According to a second aspect of this application, a method for retrieving astronomical data based on an R-tree is provided, comprising: obtaining a target star catalog and a query range; filtering based on the query range and a predetermined sky region division protocol to obtain a spatial encoding set of pixel blocks that are fully or partially covered by the query range; performing a first filtering based on the spatial encoding set and an astronomical data index of the target star catalog to obtain a block index set, wherein the astronomical data index is constructed according to the method for constructing an astronomical data index based on an R-tree provided in the embodiments of this application; performing a second filtering on each block index in the block index set to obtain data rows whose spatial locations are within the query range to obtain a data subset, and merging the data subsets retrieved from each block index to obtain a retrieval result.
[0013] According to an embodiment of this application, when the astronomical data index includes multiple physical files stored in partitions, one or more physical files corresponding to the spatial coding set are loaded into the cache.
[0014] A third aspect of this application provides an apparatus for constructing an astronomical data index based on an R-tree, comprising: an acquisition module for acquiring an astronomical star catalog, the star catalog including multiple data rows, and the fields of the star catalog including spatial location; a data block partitioning module for dividing the celestial coordinate system into multiple pixel blocks with different spatial codes based on a predetermined celestial region partitioning protocol, mapping data rows to corresponding pixel blocks based on the spatial location of the data rows to obtain the spatial code corresponding to the data rows, and partitioning data rows with the same spatial code in the astronomical star catalog into a data block to obtain multiple data blocks; a data block index construction module for constructing an R-tree index for each data block in the multiple data blocks based on the spatial location of the data rows to obtain the block index of the data block; and an astronomical data index construction module for mapping each spatial code to the block index one-to-one to obtain the astronomical data index.
[0015] A fourth aspect of this application provides an electronic device comprising: one or more processors; and a memory for storing one or more computer programs, wherein the one or more processors execute the one or more computer programs to implement the steps of the method described above.
[0016] A fifth aspect of this application also provides a computer-readable storage medium having a computer program or instructions stored thereon, which, when executed by a processor, implement the steps of the above-described method.
[0017] A sixth aspect of this application also provides a computer program product, including a computer program or instructions that, when executed by a processor, implement the steps of the above-described method.
[0018] The beneficial effects of the embodiments of this application mainly include:
[0019] (1) This application uses a fixed-level sky area division protocol (such as HEALPix grid) to spatially divide and initially filter the search range, retaining only a subset of celestial data of pixels that fall within the query range; then the subset of data is filtered again based on the R-tree index; thus quickly filtering out data in pixel blocks that are not within the query range and speeding up the search.
[0020] (2) This application modifies the value of the index built based on the sky zone division protocol in the prior art from the data object to the R tree index, and combines the two to build a global index, which improves the resolution of retrieval. Furthermore, since the index can be reused, it can significantly reduce the total query time for multiple queries.
[0021] (3) In the secondary filtering process, this application adopts the LRU (Least Recently Used) caching strategy for the loaded R-Tree index files, and temporarily stores the most recently accessed index files in memory; this reduces the number of disk I / O operations and improves the overall throughput of the retrieval process, especially in batch tasks with multiple adjacent area searches. Attached Figure Description
[0022] The above-mentioned contents, other objects, features and advantages of this application will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:
[0023] Figure 1 A flowchart illustrating a method for constructing an R-tree-based astronomical data index according to an embodiment of this application is shown schematically.
[0024] Figure 2 This illustration schematically shows a diagram of an apparatus for constructing an R-tree-based astronomical data index according to an embodiment of this application;
[0025] Figure 3 A block diagram schematically illustrates an electronic device suitable for constructing an R-tree-based astronomical data index according to an embodiment of this application;
[0026] Figure 4 This illustration shows a schematic diagram of a filtering method for indexing astronomical data in the prior art;
[0027] Figure 5 The schematic diagram illustrates the principle of the HEALPix celestial sphere division algorithm in the prior art;
[0028] Figure 6 The diagram illustrates a typical query flow for HEALPix-based cone retrieval in the prior art.
[0029] Figure 7 The schematic diagram illustrates the principle of the R-tree index used in the embodiments of this application;
[0030] Figure 8 This illustration shows a schematic diagram of the data structure of an astronomical data index constructed according to an embodiment of this application;
[0031] Figure 9 The illustration shows a query process based on a constructed astronomical data index used in an embodiment of this application. Detailed Implementation
[0032] The embodiments of this application will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of this application. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of this application for ease of explanation. However, it will be apparent that one or more embodiments may be implemented without these specific details. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concepts of this application.
[0033] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.
[0034] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.
[0035] When using expressions such as "at least one of A, B and C", they should generally be interpreted in accordance with the meaning that is commonly understood by those skilled in the art (e.g., "a system having at least one of A, B and C" should include, but is not limited to, a system having A alone, a system having B alone, a system having C alone, a system having A and B, a system having A and C, a system having B and C, and / or a system having A, B and C, etc.).
[0036] Figure 1 A flowchart illustrating a method for constructing an R-tree-based astronomical data index according to an embodiment of this application is shown.
[0037] like Figure 1 As shown, the method for constructing an R-tree-based astronomical data index in this embodiment includes operations S110 to S140.
[0038] In operation S110, an astronomical star catalog is obtained. The astronomical star catalog includes multiple data rows, and each data row includes a spatial location.
[0039] Table 1 Examples of common star catalog data
[0040] star_id ra_deg dec_deg mag_g parallax_mas class redshift …… SDSS J001 12.345678 12.34567 16.54 16.78 15.99 8.76 …… SDSS J002 34.567891 27.567891 17.33 18.75 16.11 2.88 ……
[0041] According to embodiments of this application, an astronomical star catalog is a table containing millions or tens of millions of data rows. Each data row records data for a celestial object or star. The fields mainly include several categories: identifier, location, and acquired physical data. An astronomical star catalog, as shown in Table 1, is a common catalog from the SDSS sky survey project. The values in the data rows are simulated data, used only to symbolically illustrate the data in the astronomical star catalog. As shown in Table 1, the astronomical star catalog data includes many rows, each row (each data row) representing the acquired physical data for a single celestial object. Table 1 only illustrates two data rows; a real star catalog can have tens of millions of rows. The column fields of the star catalog mainly include several categories: identifier, location, and acquired physical data. For example, the star_id column in Table 1 is a unique identifier for a celestial object. This is usually an ID generated internally by the sky survey project to precisely point to a specific celestial object, but different sky survey projects may use different names for the same celestial object. The `ra_deg` (Right Ascension) and `dec_deg` (Declination) columns in Table 1 represent the coordinates of this celestial body on the celestial sphere, i.e., its spatial position. The celestial sphere is an imaginary sphere centered on the Earth, onto which all celestial bodies are projected. To maintain cross-validation of data, different sky survey projects often use a celestial coordinate system to record positions; that is, the coordinates of the same celestial body are consistent across different projects' star catalogs or differ by only a fraction of a degree. The `mag_g`, `parallax_mas`, `class`, and `redshift` columns contain physical values related to this celestial body.
[0042] According to embodiments of this application, the spatial location can also be a three-dimensional spatial coordinate system of x, y, z on a unit sphere, converted from two-dimensional coordinates of right ascension and declination.
[0043] According to embodiments of this application, astronomical star catalogs are obtained by downloading them from the websites of multiple sky survey projects. Due to the massive amount of data in these catalogs, the downloaded catalogs are typically stored on data servers in laboratories or workstations. They can be imported into a database management system or an astronomical star catalog file management system for management. When users import downloaded catalogs into their local databases (such as MySQL) or file management systems, preprocessing is generally required before retrieval to simplify the search process. This includes standardizing the format of star catalogs from different sky survey projects, catalog segmentation, and establishing reasonable indexes to improve search speed. To address the problems of low astronomical data retrieval efficiency and unbalanced spatial indexes in existing methods, this application proposes a new index structure establishment method based on existing catalog segmentation and indexing methods.
[0044] According to an embodiment of this application, the astronomical star catalog is obtained from a downloaded and stored database management system, and the astronomical star catalog is identified by its name.
[0045] According to an embodiment of this application, the astronomical star catalog is obtained from a downloaded and stored astronomical star catalog file management system, and the astronomical star catalog is identified by its name.
[0046] In operation S120, the celestial coordinate system is divided into multiple pixel blocks with different spatial codes based on a predetermined sky area division protocol. The data row is mapped to the corresponding pixel block based on the spatial position of the data row to obtain the spatial code corresponding to the data row. Data rows with the same spatial code in the astronomical star catalog are divided into a data block to obtain multiple data blocks.
[0047] According to embodiments of this application, the predetermined celestial region partitioning protocol refers to the partitioning method or algorithm used when partitioning the index using a pseudo-two-dimensional spherical index algorithm. The predetermined celestial region partitioning protocol can be any of the following, such as: Zones Algorithm, Hierarchical Triangular Mesh (HTM), Quad Tree Cube (Q3C), Hierarchical Equal Area iso-Latitude Pixelization (HEALPix), etc. These celestial region partitioning protocols (pseudo-two-dimensional spherical index algorithms) divide the celestial sphere into multiple sub-regions, and then assign each region an ID number (i.e., spatial encoding). Celestial objects in the star catalog are stored in the database using the ID number of their respective regions as the primary key. During retrieval, the coverage area ID is calculated based on the coordinates of the query center and the query radius. All celestial objects within that ID region are read, and then the angular distance from each object to the query center is calculated. When using a pseudo-two-dimensional spherical index algorithm to perform coordinate-based retrieval of a star catalog, the database stores celestial information (right ascension, declination) and corresponding ID numbers. During retrieval, it's necessary to find all celestial bodies with the same ID number as the target, calculate their angular distances to the query center, and filter them to obtain the final, satisfactory result. Depending on the chosen hierarchical level, thousands of celestial bodies may share the same region ID. When a database is not used, and all celestial bodies with the same ID are directly stored in the same file, retrieval only requires reading the corresponding file to obtain all celestial bodies under that ID number, thus achieving the indexing and retrieval functions of the star catalog.
[0048] According to embodiments of this application, the HEALPix-based partitioning method can divide the celestial sphere into multiple sub-regions of equal area, referred to as "pixels," according to a pre-defined hierarchy. Each pixel has a corresponding number. The higher the hierarchy, the more pixels, the smaller the area of each pixel, and the higher the resolution of the celestial sphere partitioning. Its partitioning principle (e.g.) Figure 5As shown): First, the entire celestial sphere is divided into 12 quadrilateral pixel regions of equal resolution. Each pixel is then recursively divided into four self-similar smaller pixel blocks in a 2×2 pattern, thus creating a finer grid with increasingly smaller and higher-order areas. Each grid has an index number / spatial code. The specific order of division can be set. At order 0, the entire celestial sphere is divided into 12 grids, each with an area of approximately 3437.75 square degrees; at order 1, the entire celestial sphere is divided into 48 grids, each with an area of approximately 859.44 square degrees; at order 2, the entire celestial sphere is divided into 192 grids, each with an area of approximately 214.86 square degrees; and at order K, the celestial sphere is divided into... The larger the k-value, the more pixels there are, meaning finer segmentation and higher resolution. HEALPix supports two numbering modes: RING and NESTED. In RING mode, regions of the same latitude form a ring from the North Pole to the South Pole, and are numbered sequentially within the ring. In NESTED mode, the region numbered n in the i-th layer is further divided into four parts, resulting in four regions in the (i+1)-th layer, with a numbering range of... .
[0049] Other celestial sphere partitioning algorithms operate on similar principles, recursively dividing the celestial sphere according to a selected order until the size of individual pixel blocks reaches the ideal resolution. The only difference lies in the algorithms used for shape partitioning. These methods all reduce the number of objects compared directly by dividing the spherical surface into blocks, transforming spatial queries into filtering a number of pixel (or patch) numbers. For example, a hierarchical triangular mesh-based partitioning method treats the celestial sphere as a unit sphere, approximating it with an circumscribed regular octahedron. This octahedron has eight triangular faces, forming the Level 0 triangular mesh. These eight initial triangles are called "spherical triangles," with their vertices located on the coordinate axes, covering the eight quadrants of the celestial sphere. Starting from Level 0, each triangle at each level is recursively subdivided into four smaller sub-triangles with nearly equal areas. The higher the level / order of partitioning, the more regions / pixels are represented. Q3C's partitioning is as follows: Initial partitioning: A square face is initially divided into 2x2=4 equal sub-cells. Recursive subdivision: Each sub-cell can be further subdivided into four smaller 2x2 sub-cells. Subdivision depth: This subdivision process can be performed recursively many times (e.g., 20 times, 30 times). Each subdivision reduces the area of the region to 1 / 4 of its original size. The subdivision depth determines the precision of the index; the greater the depth, the smaller the area of the region corresponding to each cell.
[0050] For example, based on the spatial location of the data row, the data row is mapped to the corresponding pixel block to obtain the spatial code corresponding to the data row. Data rows with the same spatial code in the astronomical star table are divided into a data block to obtain multiple data blocks. Specifically, in the database, based on the astronomical data query statement, a B+ tree index (binary tree index) is created with the spatial code ID as the key and the right ascension RA, declination Dec, and other fields to be added to the index in the astronomical data center coordinates as the values. That is, the key of the leaf node of the B+ tree index is the spatial code ID, and the value is the data row (right ascension RA, declination Dec, and other physical fields in the astronomical data center coordinates). Since there are many data rows under the same spatial code ID, the value of the leaf node is actually all the data rows corresponding to the same spatial code ID. These all the data rows corresponding to the same spatial code ID are the data blocks, which is equivalent to "group by spatial code ID" in the SQL statement.
[0051] For example, the spatial code corresponding to the data row is obtained by mapping the data row to the corresponding pixel block based on the spatial location of the data row. Data rows with the same spatial code in the astronomical catalog are divided into a data block to obtain multiple data blocks. Specifically, in the file management system, the spatial code ID corresponding to the data row is calculated based on the right ascension RA and declination Dec of the catalog data. Then, the catalog data is grouped based on the spatial code to obtain groups (i.e., data blocks) with different spatial code IDs.
[0052] In operation S130, a data object-oriented spatial index is constructed for each of the multiple data blocks based on the spatial location of the data rows to obtain the block index of the data block.
[0053] For example, the sky area is divided into n pixels by the HEALPix index. Then, an R-tree index is built for each of these n pixels based on its spatial location to obtain the block index of the data block. Since there are n pixels, n block indexes are also built.
[0054] For example, in a file management system, after obtaining groups (i.e., data blocks) with different spatial encoding IDs, an R-tree index is constructed for each data block to obtain the corresponding block index. This R-tree block index is saved as a file and named with the spatial encoding ID. For example, 10942.rdat is the block index of the data block with spatial encoding ID 10942.
[0055] According to an embodiment of this application, the spatial index is an R-tree index. This type of index constructs a bounding box around objects in space (such as points or polygons), and indexes data objects or storage pointers of data objects through the spatial range defined by the bounding box.
[0056] According to embodiments of this application, an R-tree index refers to an index of the R-tree family, including R-trees and their variants: R*-tree, R+-tree, etc. An R-tree (rectangular tree) is a tree structure commonly used for indexing multidimensional spatial data (such as geographic coordinates and three-dimensional geometric objects). It does not directly index the geometric objects themselves, but instead constructs the index using the minimum bounding rectangle (MBR, used for two-dimensional data) or the minimum bounding box (MBB, used for three-dimensional data) (collectively referred to as the minimum spatial extent) as the key.
[0057] According to an embodiment of this application, the block index is constructed based on an R*-tree, which optimizes the node structure of the tree by forcing re-insertion (forced rearrangement).
[0058] According to an embodiment of this application, the block index is constructed based on an R+-tree. The R+-tree allows the bounding box of an object to be split in different nodes of the tree to avoid overlap and improve query efficiency, but the construction cost is higher.
[0059] In operation S140, the astronomical data index is obtained by mapping each spatial code to a block index.
[0060] According to the embodiments of this application, after mapping each spatial code to a block index to obtain the astronomical data index, a two-layer index for physical storage is not actually constructed. Only the second-layer R-tree index is stored, while the correspondence between the spatial code ID and the R-tree index is saved, for example, through a mapping table. First, the pixels of the sky space division and their corresponding spatial code IDs are obtained using a sky region division protocol. For the data in the star catalog, the corresponding spatial code ID is calculated based on the coordinates, and data blocks of each group are obtained according to the spatial code ID. Then, an R-tree index is constructed for each group of data blocks in turn, and the correspondence between the spatial code ID and the corresponding R-tree index is established. The downloaded star catalog data is converted into a one-to-one correspondence of spatial code ID and R-tree index and stored as a global index file (i.e., astronomical data index). At this time, the index file stores all the original star catalog data, which is an indexed storage of the original star catalog data. In this way, when searching, after calculating the pixel block number (spatial code ID), the corresponding R-tree index is loaded according to the spatial code ID for spatial filtering.
[0061] For example, to obtain the astronomical data index by mapping each spatial code to a block index, the specific steps are as follows: A two-level index is constructed, where the leaf nodes of the index built based on the sky region partitioning protocol store the spatial code ID as the key and the value as a pointer to the R-Tree index corresponding to that spatial code ID. The first-level index (e.g., an index based on HEALPix) is stored as an index file, and all R-Tree indexes are stored as a single index file. The leaf nodes of the first-level index store pointers to the corresponding positions in the R-Tree file.
[0062] According to the embodiments of this application, the astronomical data index is obtained by mapping each spatial code to a block index one by one. Specifically, a two-level index with two layers is constructed, wherein the key of the leaf node of the index constructed based on the sky area partitioning protocol stores the spatial code ID, and the value stores the R-Tree index corresponding to the spatial code ID.
[0063] According to the embodiments of this application, it supports efficient local operation, avoids network dependence, and requires no database management experience: the two-level index retrieval scheme provided by this application can be packaged into an independent local program, directly reading the star table file data and its index file stored on the local disk for querying, without importing data into a database or mastering professional database management knowledge, and without relying on real-time network connections to online data centers. Reason analysis: Local deployment allows data retrieval and index access to be completed entirely in the local CPU and disk I / O environment, avoiding the complex process of database installation, configuration, and maintenance; it avoids network latency, bandwidth limitations, packet loss, and other problems that may occur when accessing online data centers, especially cross-border platforms. Final effect: Even in unstable network environments or with limited bandwidth, this method can still maintain a stable and efficient retrieval speed, suitable for offline work scenarios and large-scale batch processing tasks; and users do not need database management experience to complete the operation; at the same time, it reduces dependence on external data centers, improving the flexibility, autonomy, and security of data use. It overcomes the shortcomings of "network and deployment-related limitations" in existing technologies.
[0064] The common process for retrieving astronomical data is as follows: taking the cone search based on HEALPix as an example ( Figure 6 As shown), a typical query process is usually as follows:
[0065] (1) The user inputs the center coordinates (ra0, dec0) and radius r (e.g., center point (α0, δ0), radius 1°);
[0066] (2) Calculate the set of pixel numbers covered by the query range based on the HEALPix order used. This set may consist of several pixels, the number of which is affected by the retrieval radius and the HEALPix division order. For example, assuming the HEALPix order is set to 8, the celestial sphere is divided into... There are several pixel blocks, each with an area of approximately 0.05244 square degrees. Since the diameter of a full moon is approximately 0.5 degrees, its area is approximately π*(0.25)²≈0.196 square degrees. A full moon can cover approximately 0.196 / 0.05244≈3.7 8th-order pixels. The famous Andromeda Galaxy (M31) extends approximately 3° × 1° in the sky, meaning its maximum area can reach approximately 3 square degrees. Therefore, an 8th-order pixel (0.052 square degrees) is much smaller than M31, and M31 can easily cover dozens or even hundreds of such pixels.
[0067] (3) Read the data file or database records corresponding to these pixels, traverse each target record, and calculate the angular distance between the target and the query center;
[0068] (4) Filter out the records that actually fall within the query range based on the angular distance (e.g., using Haversine or spherical cosine formula) and return the results.
[0069] One of the commonly used formulas for angular distance, the Haversine formula, is as follows:
[0070]
[0071] in Let the declinations of the two points whose distance is to be calculated be denoted as . Each is its own right ascension.
[0072] It should be emphasized that, according to the above process, each data entry that falls into the initial screening pixel needs to be judged by angular distance (i.e., each entry is calculated and determined to be within the query range), which is very time-consuming.
[0073] This embodiment of the application, through steps S110-S140, combines an astronomical index built based on a predetermined sky partitioning protocol with a spatial index oriented towards data objects (R-tree and its variants). This refines the original pixel-level index to the data object level. The final effect is as follows: compared to using indexes based on sky partitioning protocols such as HEALPix alone, this method reduces the number of objects requiring precise angular distance calculations, significantly reducing the computational load; compared to using R-Tree alone for full sky traversal, this method reduces the number of MBRs that need to be traversed, reducing I / O access and avoiding the performance overhead caused by large-scale projection transformations. It overcomes the shortcomings of existing technologies, such as "the need for angular distance calculation after initial screening leading to high computational overhead" and "the lack of an efficient batch spatial filtering mechanism."
[0074] According to an embodiment of this application, the R-tree index includes a root node, internal nodes, and leaf nodes, with at least one internal node between the root node and leaf nodes. The key of the root node and internal nodes of the block index is the minimum outer bounding space of the data block, and the value is the minimum outer bounding space of each child node stored in the root node. The key of the leaf nodes of the block index is the minimum outer bounding space of the data row of the data entry stored in the leaf node, and the value is a data row or a pointer to a data row. This embodiment defines a specific data structure based on the R-tree index. Essentially, the indexing algorithm based on a preset celestial sphere division is a variation of the spatially decomposed indexing concept in astronomical data. The R-tree family of indexes is an object-space-oriented index. By combining spatially decomposed indexes and object-space-oriented indexes, it achieves higher retrieval efficiency than a single indexing method.
[0075] According to embodiments of this application, the R-tree index consists of three types of nodes:
[0076] (1) Leaf nodes store the actual data rows or their pointers. Each entry (node) corresponds to one MBR / MBB (minimum outer space). The space range of a leaf node is a larger MBR / MBB that includes all entries. The number of entries that each leaf node can hold is configurable.
[0077] (2) Internal nodes, located between leaf nodes and root nodes, store the spatial extent information of child nodes. The extent of each internal node is the bounding rectangle (or cuboid) of the MBR / MBB of its child nodes. The number of child nodes that each internal node can accommodate can be set, and there can be multiple layers.
[0078] (3) Root node, the top-level node of the R-tree, there is only one. Its MBR covers the entire dataset. When the data increases to the point where the number of nodes in the root node exceeds the limit, the original root node splits and is reduced to two internal nodes, and a new root node is formed on top of them.
[0079] According to an embodiment of this application, if the spatial location field is right ascension and declination, then the minimum bounding space of the R-tree index is the minimum bounding rectangle MBB.
[0080] According to an embodiment of this application, the spatial location is based on three-dimensional coordinates of the celestial sphere, and the minimum bounding space of the R-tree index is the minimum bounding cuboid / MBR.
[0081] According to embodiments of this application, the node capacity of internal nodes and leaf nodes when constructing the R-tree index is a set range. The set range [m, M] is used to constrain the range of child nodes or sub-entries stored in each node, where m is typically 30%-50% of M, set according to the situation. A smaller set node capacity results in a more compact node MBB / MBR, leading to better pruning during queries; a larger set node capacity results in a more loosely distributed node MBB / MBR, leading to worse pruning during queries. For example, a set range of [20, 50] means the node capacity when constructing the R-tree index is no more than 50 and no less than 20; an example set range of [4, 8] also exists.
[0082] According to embodiments of this application, the distribution of celestial targets in space is often uneven. Taking multi-level recursive partitioning methods such as HTM and HEALPix as examples, although they can divide the sky into sub-blocks of equal area, the actual number of celestial objects contained in each sub-block varies greatly, which can easily lead to an unbalanced load on data storage and retrieval. In contrast, the node partitioning of R-Tree does not rely on the consistency of the area, but constrains the number of sub-entries stored in each node by setting a node capacity range (e.g., [4,8]). In this way, regardless of whether the targets are densely or sparsely distributed in space, each node can maintain a relatively similar storage scale, avoiding the situation where some nodes are overloaded or too empty, thereby achieving a more balanced storage and retrieval load and avoiding large-scale scanning directly in dense areas. This solves the problem of "hotspot pixel data skew" that may occur in indexes of single celestial partitioning algorithms such as HEALPix when the spatial distribution of celestial targets is uneven; at the same time, it reduces the disk access bottleneck caused by excessive storage in a single node, so that queries maintain a stable response speed in both high-density and low-density areas. It overcomes the shortcomings of existing technologies, such as "uneven index granularity and data distribution leading to unstable I / O and computational load".
[0083] According to embodiments of this application, the predetermined sky area partitioning protocol is a multi-level equal-area partitioning method, and the partitioning order is determined based on the pixel cardinality of each dimension of the multi-level equal-area partitioning method. The choice of partitioning order aims to find the optimal balance between the accuracy (efficiency) of spatial queries and the volume (cost) of data storage. By selecting a reasonable partitioning order, the storage and preprocessing overhead caused by overly refined partitioning is avoided: this method uses a fixed and moderate resolution order in the first-level HEALPix filtering to avoid excessive partitioning with too high an order; simultaneously, the HEALPix filtering result is used as a positioning reference, and the second-level R-Tree performs high-precision filtering within a local area, thereby limiting the scope of the high-precision index to necessary local regions. Reasoning: Fixed medium-precision HEALPix partitioning can effectively narrow down the candidate dataset globally, while the second-level R-Tree filtering replaces the role of higher-order HEALPix partitioning within local regions, avoiding the need to build and maintain a massive number of small-pixel files globally to reduce the amount of precise computation; this division of labor maintains high retrieval accuracy while reducing the number of global files and the burden of metadata management. Beneficial effects: This scheme avoids the problems of a surge in the number of fragmented files, increased preprocessing time, metadata overhead, and storage fragmentation caused by high-order HEALPix partitioning without sacrificing retrieval accuracy and efficiency. At the same time, it maintains high efficiency in index building and file management through a combination of local high precision and global medium precision. It overcomes the defect in existing technologies of "the contradiction between fine-grained partitioning and preprocessing / storage costs."
[0084] According to an embodiment of this application, the astronomical data index is obtained by mapping each spatial code to a block index, including: constructing a numerical index for the spatial codes to obtain a first index, where the key of the leaf node of the first index is the spatial code; and setting the value of the leaf node of the first index to the block index or a pointer to the block index to obtain the astronomical data index. This improves retrieval computation efficiency: during the retrieval process, a fixed-level HEALPix grid is first used to spatially divide and initially filter the retrieval range, retaining only celestial data falling within the relevant pixels; then, this subset of data undergoes another spatial range filtering based on an R-Tree; finally, precise angular distance calculations are performed only on the data that passes the second-level filtering. Reasoning: HEALPix grid division and pixel number lookup involve only simple integer calculations, which are fast; R-Tree determines intersection relationships based on rectangular boundary comparisons, avoiding complex spherical trigonometric formula calculations. The two-level indexes work together to run the R-Tree on a smaller dataset after HEALPix filtering, further reducing unnecessary precise calculations.
[0085] According to embodiments of this application, the node set of the astronomical data index is partitioned and stored in multiple physical files, and the mapping relationship between the partitioned storage node set and the physical files is stored in a mapping table. The mapping table records the mapping relationship from each node identifier to its storage location descriptor. Considering that directly loading the entire R-Tree index file is very time-consuming, in order to accurately load the required R-Tree file, a partitioned R-Tree index is constructed based on the idea of partitioned indexing. Taking the HEALPix partition as an example, assuming that the HEALPix partition mentioned above uses an 8-level partition, it contains more than 780,000 pixel blocks. If no partitioned index is performed, the R-Tree index contains the coordinate range of more than 780,000 pixel blocks, resulting in an excessive storage volume occupied by the index file. The time to load the index file into the cache during the first query is also long. To optimize this situation, a partitioned index based on the HEALPix partition can be created to obtain different partitioned index files. During the query, the partitioned index file is loaded according to the partition where the query range is located, which can greatly reduce the file size of the loaded index file and correspondingly reduce the loading time of the index file. Therefore, when building an R-Tree index, the global data is first divided into logically independent spatial units through HEALPix first-level partitioning, and the R-Tree index files corresponding to each unit are independent of each other. Different HEALPix partition orders can be selected according to requirements. This divide-and-conquer structure avoids the problem of a single global index becoming too large and difficult to maintain. Each index file only manages local data, and the file size remains stable within a controllable range. Beneficial effects: 1) When the star table data scale expands to billions or even higher, only the corresponding HEALPix units and their R-Tree files need to be added, without rebuilding the global index. It can scale linearly with data growth, ensuring the long-term availability of the system in big data scenarios. 2) In terms of computing and storage resource utilization: index files of different HEALPix blocks can be loaded and queried in parallel, naturally adapting to multi-core CPUs or distributed nodes. Since query requests are already routed to a limited number of blocks at the HEALPix level, each computing node only needs to process the local index, and there is no global lock contention. 3) Supports distributed scheduling and parallel retrieval, significantly improving throughput. 4) In a cluster environment, HEALPix blocks can be dynamically allocated according to the load. The system can make full use of multi-core parallel computing and distributed environment to achieve high-concurrency retrieval and fast response, thereby improving the system's resource utilization efficiency.
[0086] According to an embodiment of this application, a file containing a spatial encoding ID and an R-tree index is split and stored, and the node markers of the split are recorded for easy retrieval.
[0087] According to embodiments of this application, the entire R-Tree index is stored as multiple index files. For example, it is assumed that the first-level index is obtained by dividing the sky region into k levels based on HEALPix. For a pixel block, the R-Tree index also has a corresponding value. One; then, based on HEALPix, the sky area is divided into The order is obtained Each sky region will Each pixel block is split into files based on its corresponding celestial region. The number of pixel blocks determines the number of r-tree index files created. The pointer of the first-level index points to the R-tree storage location in the corresponding storage file.
[0088] According to the embodiments of this application, the first-level index is pruned or split into root nodes and then stored, while the splitting situation (such as the partition index strategy of MySQL) is recorded, thereby splitting a very large index file into multiple smaller storage files.
[0089] According to embodiments of this application, the index exhibits excellent scalability in incremental updates and maintenance. New data only needs to be inserted into its corresponding HEALPix block and the R-Tree index of that block needs to be updated. The partitioned index structure localizes data updates, avoiding frequent reconstruction of the global index. Beneficial effects include: facilitating daily star catalog data updates (e.g., observatories continuously releasing new observational data), supporting high-frequency small-batch updates, and ensuring the long-term stable operation of the retrieval system.
[0090] According to a second aspect of this application, a method for retrieving astronomical data based on an R-tree is provided, comprising: obtaining a target star catalog and a query range; filtering based on the query range and a predetermined sky region division protocol to obtain a spatial encoding set of pixel blocks covered or partially covered by the query range; performing a first filtering based on the spatial encoding set and the astronomical data index of the target star catalog to obtain a block index set, wherein the astronomical data index is constructed according to the construction method of the astronomical data index based on an R-tree provided in the embodiments of this application; performing a second filtering on each block index in the block index set to obtain data rows whose spatial locations are within the query range to obtain a data subset, and merging the data subsets retrieved from each block index to obtain the retrieval result. Taking the B+ tree index of the InnoDB storage engine in MySQL as an example, when the index of the prior art is retrieved, one IO reads one page of data (16KB). For leaf nodes, assuming a row of data is 1KB in size, then one page can only store 16 rows of data. For pixel blocks of astronomical star catalogs, the I / O load is also very high due to the large amount of data that needs to be returned. This application directly filters the block data based on R-trees and their variants and returns the filtered results, instead of returning the data blocks through indexing in the prior art and then filtering them, which greatly reduces the I / O load.
[0091] For example, when constructing the R-tree index, the R-tree index file is named with its corresponding ID, such as 10942.rdat for level 7; during retrieval, the ID number is calculated (such as 10942, 10943), and then the R-tree file corresponding to the filename "ID number.rdat" is directly loaded for retrieval (such as 10942.rdat, 10943.rdat).
[0092] This application combines an index based on a predetermined sky region partitioning protocol (or a pseudo-two-dimensional spherical index algorithm index) with an R-tree family of indexes. Data retrieval is performed based on the constructed index, and I / O performance is optimized with a caching mechanism. This can significantly improve efficiency and maintain good scalability in large-scale star catalog retrieval.
[0093] To accelerate the angular distance filtering, this application further constructs an R-Tree index based on the spatial position coordinates formed by right ascension and declination of the data blocks after the first-level indexing of the pseudo-two-dimensional spherical index algorithm, thereby obtaining a two-level index as the star table index. Figure 8 ):
[0094] First, the first-level index is a pseudo-two-dimensional spherical index algorithm built based on a celestial sphere partitioning protocol. Taking the HEALPix algorithm as an example, with k=8, the celestial sphere is divided into... A pixel block, and for this Each pixel block is spatially encoded to obtain a spatial encoding ID, with encoding methods including NESTED encoding or RING encoding. The first-level index then constructs the correspondence between the spatial encoding ID and the data block (or sub-table) consisting of all data rows with that spatial encoding ID.
[0095] Furthermore, this application constructs an R-tree index for each data block based on its spatial location. For example... Figure 7 As shown, the R-Tree index query principle is as follows: when querying, start from the root node and traverse from top to bottom layer by layer to find the MBR or MBB regions that intersect with the query range: if they intersect, continue to visit their child nodes; if they do not intersect, prune (skip the branch) and do not need to access the data in it, thus avoiding unnecessary traversal.
[0096] The criteria for determining whether they intersect are: whether the boundary coordinates of the query range are located within the node's MBR (or MBB).
[0097] Query process: Starting from the root node, determine from top to bottom whether the node intersects with the query range.
[0098] Judgment conditions: Assume MBR coordinates: (mbr_xmin, mbr_ymin, mbr_xmax, mbr_ymax), where mbr_xmin represents the minimum right ascension coordinate within the query range, mbr_ymin represents the minimum declination coordinate within the query range, mbr_xmax represents the maximum right ascension coordinate within the query range, and mbr_ymax represents the maximum declination coordinate within the query range.
[0099] The query range coordinates are (q_xmin, q_ymin, q_xmax, q_ymax), which means determining if mbr_xmin ≤ q_xmax and mbr_xmax ≥ q_xmin, and if mbr_ymin ≤ q_ymax and mbr_ymax ≥ q_ymin.
[0100] If the nodes intersect, continue visiting the child nodes; otherwise, prune and skip them. This mechanism avoids unnecessary traversals and reduces computation.
[0101] The characteristics of R-Tree applications in astronomical data are as follows: Although the physical data storage of astronomical star catalogs is complex, the data used as keys is still spatial point data. Therefore, each celestial body can be regarded as an MBR / MBB (degenerated into a point). R-trees can make the amount of data within each node similar, so the load of each branch of the R-tree index can maintain a certain balance.
[0102] If the capacity of leaf nodes and internal nodes is set uniformly, the index imbalance problem caused by uneven spatial distribution of celestial bodies can be alleviated (more balanced than when HEALPix is used alone).
[0103] Compared to existing HEALPix-based indexes where the key of the leaf node stores the space-encoded ID and the value stores the actual data (data block or pointer to data block), the pseudo-two-dimensional spherical index algorithm of the first layer of this application (taking the HEALPix-based index as an example) stores the space-encoded ID of the leaf node and no longer stores the data block or a pointer to the storage location of the data block, but instead stores the R-Tree index or a pointer to the R-Tree index.
[0104] This application combines the advantages of a spherical two-dimensional index based on celestial region partitioning (using HEALPix as an example only) and R-Tree, avoiding the shortcomings of using either alone. The retrieval process is as follows: Figure 9 As shown:
[0105] (1) First, perform first-level filtering (or first-layer filtering) based on HEALPix partitions: After the user inputs the coordinates of the query center and the corner radius, the pixel block number of the covered area is first calculated using HEALPix, and only the pixel block number of the covered area is filtered. For example Figure 4 The search range (shown in red circle) covers pixel block numbers from 1 to 16.
[0106] (2) R-Tree Second-Level Filtering: For the star table file corresponding to each pixel block, the established R-Tree index is loaded according to the pixel block number (spatial encoding ID), and fast filtering based on MBB / MBR intersection judgment is performed. Only nodes within the query range are retained, and those outside the range are directly excluded, without the need for complex spherical distance calculation.
[0107] (3) Precise angular distance calculation: Precise angular distance calculation is performed only on the data within the nodes filtered by R-Tree to obtain the final result.
[0108] According to an embodiment of this application, after the algorithm based on the sky area division protocol calculates the number of the covered pixel blocks, the corresponding R-tree index is loaded according to the pixel block number, MBB / MBR filtering is performed based on the R-tree index to obtain the search results, and the merged result is returned after merging the spatial filtering results of all pixel blocks.
[0109] According to an embodiment of this application, after the algorithm based on the sky area division protocol calculates the pixel block number to be filtered, the R-Tree index of the pixel block number is first found based on the first-level index and returned. Then, MBB / MBR filtering is performed based on the R-tree index to obtain the filtering result. After merging the spatial filtering results of all pixel blocks, the merged result is returned.
[0110] According to the embodiments of this application, after the algorithm based on the sky area division protocol calculates the pixel block number to be filtered, the R-Tree index of the pixel block number is first found based on the first-level index, and the MBB / MBR filtering of the R-tree index is further performed to obtain the filtering result. After merging the spatial filtering results of all pixel blocks, the merged result is returned.
[0111] According to an embodiment of this application, when the R-Tree index is stored in multiple files, the corresponding R-Tree file is first located based on the pixel block number covered by the filter, and then the corresponding R-Tree is loaded.
[0112] R-Tree can be serialized and stored, and when querying, you only need to deserialize and read the information, which is fast.
[0113] For example, this invention uses an R-Tree index structure to organize the star table data and stores the R-Tree node and leaf data in binary serialization. The serialized R-Tree can be directly written to disk, retaining node information and minimum bounding rectangle (MBR) information, so that subsequent queries can directly deserialize and load it without re-parsing the original text data (such as CSV files). In file mode, since the original data is usually in text format, the first query requires file reading, line-by-line parsing, field splitting, and filtering operations, which is costly. By serializing the R-Tree, the index and leaf data are stored in binary form, maintaining the original structural information. Subsequent queries only need to deserialize and load from disk or memory, without repeating the expensive text parsing and split / filter operations, thus significantly reducing query time. This provides the following advantages: (1) When loading the R-tree for the first time, the complete index can be obtained through deserialization, eliminating the need to repeatedly parse text files and reducing CPU computation and I / O overhead; (2) Reduced system resource consumption: avoids frequent parsing of large-scale CSV files, reducing memory and processor usage; (3) Support for large-scale datasets: the serialized index can be persisted to disk or distributed storage, enabling fast access and efficient querying of massive amounts of data; (4) Improved repeatable query performance: since the index structure has been persisted, subsequent repeated queries can directly load the serialized structure, further shortening the query response time.
[0114] According to embodiments of this application, astronomical data indexes are loaded into a cache. When the astronomical data index includes multiple physical files stored in partitions, one or more physical files corresponding to the spatial encoding set are loaded into the cache. This reduces duplicate data loading and improves cache hit rate: During the secondary filtering process, an LRU (Least Recently Used) caching strategy is adopted for the loaded R-Tree index files, temporarily storing the most recently accessed index files in memory. Reasoning: The LRU strategy prioritizes storing frequently accessed index files in the computer cache, avoiding the same index file being loaded from disk multiple times in a short period, thereby reducing disk read / write overhead. Final effect: Reduces disk I / O operations and improves the overall throughput of the retrieval process, especially in batch tasks involving multiple searches of adjacent regions. In astronomical cross-matching (which involves matching star catalog data from different sources based on celestial coordinates to identify records representing the same object in the same celestial region, allowing researchers to obtain rich physical data on the same celestial object in different catalogs and at different dimensions), the process typically requires batch and repetitive searches and comparisons of multiple catalogs in the same or adjacent celestial regions. This results in frequent access to the same spatial index file. LRU caching significantly reduces the time spent repeatedly loading index files, thereby accelerating the overall execution speed of cross-matching. This overcomes the challenges of I / O and parallel processing in existing technologies.
[0115] When loading the R-Tree index file, an LRU caching strategy is employed to temporarily store the index file in the computer cache. This avoids repeated loading and deserialization operations, further improving retrieval efficiency. Through this query method, some targets that previously required angular distance calculations are filtered out using the less computationally intensive R-Tree search, reducing the computational load of subsequent complex calculations and improving overall efficiency. It also avoids the burden of directly loading and deserializing large amounts of R-Tree files. Furthermore, the R-Tree index file can be reused long-term after a single loading, significantly saving researchers time in data sifting.
[0116] According to embodiments of this application, when performing a join operation on multi-band data from multiple target star catalogs, the results can be returned faster compared to existing technologies.
[0117] Table 2 shows the retrieval tests in file mode. The tests included retrieval based on an R-Tree index built after partitioning the Gaia DR3 all-star table (containing 1.8 billion celestial object data entries, occupying 4.1TB of hard drive space) into a 7-level HEALPix file (with internal nodes and leaf nodes limited to a minimum of 2 and a maximum of 4 data entries), and retrieval directly from the files after the 7-level HEALPix partition. The table shows the time taken for both retrieval methods with a query center of (292°, 10°) and a query radius of 0.1°. Specifically, the number of candidate pixels represents the number of HEALPix pixel blocks covered by the query range; the number of loaded files represents the number of files actually containing celestial object data within the aforementioned pixel blocks (some pixel areas may not have celestial objects, hence no corresponding folder exists); the number of candidate file entries represents the total number of celestial objects contained in the aforementioned files; the number of angular distance calculations represents the number of targets that each retrieval method needs to process; the number of query results represents the number of celestial objects within the query range, i.e., the number of retrieval results; and finally, the total query time is the average of 10 experiments.
[0118] Table 2. Comparison of the index construction method of this application with the index method and the single index method.
[0119]
[0120] Table 2 shows that the query range covers 5 pixel blocks, all of which contain the target; there are a total of 342,749 targets within these 5 pixel blocks. For the retrieval method using only the HEALPix index, it is necessary to calculate the angular distance from all targets to the query center; however, the R-Tree & HEALPix retrieval method, due to the secondary filtering by R-Tree, excludes many targets and only requires 13,246 calculations, less than 4% of the former.
[0121] In the test, the retrieval of the R-Tree index built on the 7-level file partitioning of HEALPix took only 1051.5ms, while the retrieval of the files directly partitioned by HEALPix took 9052ms, which improved the retrieval efficiency.
[0122] Based on the above-described method for constructing an astronomical data index based on an R-tree, this application also provides an apparatus for constructing an astronomical data index based on an R-tree. The following will combine... Figure 2 The device is described in detail.
[0123] Figure 2 A schematic block diagram of a construction apparatus for an R-tree-based astronomical data index according to an embodiment of this application is shown.
[0124] like Figure 2As shown, the astronomical data index construction apparatus 200 based on R-tree in this embodiment includes an acquisition module 210, a data block partitioning module 220, a data block index construction module 230, and an astronomical data index construction module 240.
[0125] The acquisition module 210 is used to acquire an astronomical star catalog, which includes multiple data rows, each data row including a spatial location. In one embodiment, the acquisition module 210 can be used to perform the operation S110 described above, which will not be repeated here.
[0126] The data block partitioning module 220 is used to divide the celestial coordinate system into multiple pixel blocks with different spatial codes based on a predetermined celestial region partitioning protocol. It maps data rows to corresponding pixel blocks based on the spatial position of data rows in the astronomical star catalog to obtain the spatial code corresponding to the data row. Data rows with the same spatial code in the astronomical star catalog are partitioned into one data block to obtain multiple data blocks. In one embodiment, the data block partitioning module 220 can be used to perform the operation S120 described above, which will not be repeated here.
[0127] The data block index construction module 230 is used to construct an R-tree index for each data block based on the spatial location of the data rows to obtain the block index of the data block. In one embodiment, the spatial index construction module 230 can be used to perform the operation S130 described above, which will not be repeated here.
[0128] The astronomical data index construction module 240 is used to obtain the astronomical data index by mapping each spatial code to a block index. In one embodiment, the global index construction module 240 can be used to perform the operation S140 described above, which will not be repeated here.
[0129] According to embodiments of this application, any multiple modules among the acquisition module 210, data block partitioning module 220, data block index construction module 230, and astronomical data index construction module 240 can be merged into one module, or any one of these modules can be split into multiple modules. Alternatively, at least some of the functions of one or more of these modules can be combined with at least some of the functions of other modules and implemented in one module. According to embodiments of this application, at least one of the acquisition module 210, data block partitioning module 220, data block index construction module 230, and astronomical data index construction module 240 can be at least partially implemented as hardware circuitry, such as a field-programmable gate array (FPGA), a programmable logic array (PLA), a system-on-a-chip, a system-on-a-substrate, a system-on-package, an application-specific integrated circuit (ASIC), or implemented in hardware or firmware by any other reasonable means of integrating or packaging the circuitry, or implemented in software, hardware, or firmware, or in any suitable combination of any of these three implementation methods. Alternatively, at least one of the acquisition module 210, the data block partitioning module 220, the data block index construction module 230, and the astronomical data index construction module 240 can be at least partially implemented as a computer program module, which can perform corresponding functions when the computer program module is run.
[0130] Figure 3 A block diagram schematically illustrates an electronic device suitable for implementing a method for constructing an R-tree-based astronomical data index according to an embodiment of this application.
[0131] like Figure 3 As shown, an electronic device 300 according to an embodiment of this application includes a processor 301, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 302 or a program loaded from a storage portion 308 into a random access memory (RAM) 303. The processor 301 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 301 may also include onboard memory for caching purposes. The processor 301 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of this application.
[0132] RAM 303 stores various programs and data required for the operation of electronic device 300. Processor 301, ROM 302, and RAM 303 are interconnected via bus 304. Processor 301 executes various operations of the method flow according to embodiments of this application by executing programs in ROM 302 and / or RAM 303. It should be noted that the programs may also be stored in one or more memories other than ROM 302 and RAM 303. Processor 301 may also execute various operations of the method flow according to embodiments of this application by executing programs stored in said one or more memories.
[0133] According to embodiments of this application, the electronic device 300 may further include an input / output (I / O) interface 305, which is also connected to a bus 304. The electronic device 300 may also include one or more of the following components connected to the input / output (I / O) interface 305: an input section 306 including a keyboard, mouse, etc.; an output section 307 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 308 including a hard disk, etc.; and a communication section 309 including a network interface card such as a LAN card, modem, etc. The communication section 309 performs communication processing via a network such as the Internet. A drive 310 is also connected to the input / output (I / O) interface 305 as needed. A removable medium 311, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 310 as needed so that computer programs read from it can be installed into the storage section 308 as needed.
[0134] This application also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs, which, when executed, implement the method according to the embodiments of this application.
[0135] According to embodiments of this application, the computer-readable storage medium can be a non-volatile computer-readable storage medium, such as including but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this application, the computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this application, the computer-readable storage medium may include ROM 302 and / or RAM 303 and / or one or more memories other than ROM 302 and RAM 303 described above.
[0136] Embodiments of this application also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code enables the computer system to implement the method for constructing an R-tree-based astronomical data index provided in the embodiments of this application.
[0137] When the computer program is executed by processor 301, it performs the functions defined in the system / apparatus of this application embodiment. According to the embodiments of this application, the systems, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0138] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and may be downloaded and installed via communication section 309, and / or installed from removable medium 311. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.
[0139] In such an embodiment, the computer program can be downloaded and installed from a network via the communication section 309, and / or installed from the removable medium 311. When the computer program is executed by the processor 301, it performs the functions defined in the system of this application embodiment. According to the embodiments of this application, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0140] According to embodiments of this application, program code for executing the computer programs provided in the embodiments of this application can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. Programming languages include, but are not limited to, languages such as Java, C++, Python, "C", or similar programming languages. The program code can be executed entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0141] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0142] Those skilled in the art will understand that the features described in the various embodiments of this application can be combined and / or combined in various ways, even if such combinations or combinations are not explicitly described in this application. In particular, the features described in the various embodiments of this application can be combined and / or combined in various ways without departing from the spirit and teachings of this application. All such combinations and / or combinations fall within the scope of this application.
Claims
1. A method for constructing an astronomical data index based on an R-tree, characterized in that, The method includes: Obtain an astronomical star catalog, which includes multiple data rows and fields including spatial location; Based on a predetermined sky region division protocol, the celestial coordinate system is divided into multiple pixel blocks with different spatial codes. Based on the spatial position of the data row, the data row is mapped to the corresponding pixel block to obtain the spatial code corresponding to the data row. Data rows with the same spatial code in the astronomical star catalog are divided into one data block to obtain multiple data blocks. For each of the plurality of data blocks, an R-tree index is constructed based on the spatial location of the data rows to obtain the block index of the data block. The R-tree index includes a root node, internal nodes, and leaf nodes, and there is at least one internal node between the root node and the leaf nodes. The key of the root node and the internal nodes of the block index is the minimum outer space of the data block, and the value is the minimum outer space of each child node stored in the root node. The key of the leaf node of the block index is the minimum outer space of the data row of the data entry stored in the leaf node, and the value is the data row or a pointer to the data row. The astronomical data index is obtained by mapping each of the spatial codes to the block index, including: A first index is obtained by constructing a numerical index on the spatial code, and the key of the leaf node of the first index is the spatial code; The astronomical data index is obtained by setting the value of the leaf node of the first index to the block index or a pointer to the block index.
2. The method according to claim 1, characterized in that, When constructing the R-tree index, the node capacity of the internal nodes and leaf nodes is within a set range.
3. The method according to claim 1, characterized in that, The predetermined sky area division protocol is a multi-level equal area division method, and the division order is determined based on the pixel base number in each dimension of the multi-level equal area division method.
4. The method according to claim 1, characterized in that, The node set of the astronomical data index is partitioned and stored in multiple physical files, and the mapping relationship between the partitioned storage node set and the physical files is stored in a mapping table. The mapping table records the mapping relationship between each node identifier and its storage location descriptor.
5. A method for retrieving astronomical data based on R-trees, characterized in that, The method includes: Obtain the target star list and query range; Based on the query range and the predetermined sky area division protocol, a spatial encoding set of pixel blocks covered or partially covered by the query range is obtained through filtering. A block index set is obtained by performing a first filter on the astronomical data index of the target star catalog based on the spatial coding set, wherein the astronomical data index is constructed according to the method described in any one of claims 1-4; For each block index in the block index set, a second filter is performed sequentially based on the block index to obtain data rows whose spatial location is within the query range, thus obtaining a data subset. The data subsets retrieved from each block index are then merged to obtain the retrieval result.
6. The method according to claim 5, characterized in that, The astronomical data index is loaded into the cache. When the astronomical data index includes multiple physical files stored in partitions, one or more physical files corresponding to the spatial encoding set are loaded into the cache.
7. A computer-readable storage medium having a computer program or instructions stored thereon, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 6.
8. A computer program product, comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 6.