A method for constructing, integrating and retrieving self-defined index of large-scale complex structure data
By integrating custom index building and data parsing modules, the management and retrieval problems of large-scale, complex scientific data have been solved, achieving flexible adaptation and efficient retrieval, applicable to marine environmental, geophysical, and biological genetic data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING TIANLEI DIGITAL INTELLIGENCE TECHNOLOGY CO LTD
- Filing Date
- 2026-02-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies struggle to efficiently manage and retrieve large-scale, complex scientific data, especially in scenarios involving marine environmental, geophysical, and biological genetic data acquisition, where the data is massive in scale, complex in format, inefficient in retrieval, and lacks flexibility.
A custom indexing method is adopted, which allows users to configure spatial, feature, and time dimensions through the user interface to generate a multi-dimensional combined index table. Combined with the data parsing and integration module, data volume statistics are performed during data entry, and aggregation or flattening is performed during retrieval to adapt to the data structures of different scientific research fields.
It enables flexible adaptation to multi-source heterogeneous data without modifying the system logic, improving retrieval efficiency, reducing storage and computing costs, and supporting efficient data location and file filtering.
Smart Images

Figure CN122196236A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to scientific data management and retrieval technology, specifically to a method for constructing, integrating, and retrieving custom indexes for large-scale, complex structured data. Background Technology
[0002] In scientific data applications, especially those related to sensors and experimental instruments, such as marine environmental data acquisition, geophysical data acquisition, and biological genetic data acquisition,
[0003] The management and retrieval of this type of data face the following challenges:
[0004] One is that individual files are huge, usually exceeding hundreds of MB, or even several GB or tens of GB;
[0005] Secondly, the file size is enormous, with the cumulative number of files collected over the years potentially exceeding several million or even tens of millions.
[0006] Thirdly, for each type of data, due to differences in data collection instruments and equipment, processing and production software systems, species subcategories, and research fields, coupled with the lack of unified data standards, the data styles (file formats, parsing methods, data fields) are complex and varied.
[0007] Fourthly, the data structure is very complex. Normally, a data table may have only a few data fields (for example, a student record table has name, age, class, etc.), at most a dozen or dozens of fields, while scientific data (such as gene sequencing data) may have more than hundreds of fields, and each organism is different.
[0008] Fifth, data files usually come from different sources. Although the files have been processed to ensure the accuracy of the collected data, due to the different collection and detection methods and the different collection and detection purposes, there are data duplication, regional overlap, data lack, etc. among multiple data files. For example, in the same environmental data of the Indian Ocean, the amount of data that meets the above conditions in most data files of shallow sea salinity data below 100 meters or subdivided areas of interest to scientific research users (such as the sea area near a certain port) is very small and has no research value.
[0009] The application of this type of data has the following characteristics:
[0010] Firstly, storage and computing costs are high. Data volumes often reach hundreds of terabytes or even petabytes, making it difficult to bear the overhead of redundant data storage and repeated processing.
[0011] Secondly, retrieval efficiency is low. Researchers often need to locate data within specific regions, feature ranges, and time periods from millions of documents. Existing methods based on document traversal or fine-grained entry indexing are either slow due to huge I / O overhead or impractical due to the explosion of index size.
[0012] Thirdly, there is a lack of flexibility. Multi-source heterogeneous data has complex formats, and fixed-pattern indexes are difficult to adapt flexibly to the unique data structures and query needs of different scientific research fields (such as oceanography, geology, and geophysics).
[0013] Developing data parsing, indexing, and retrieval methods for complex, large-scale scientific data to support researchers in accessing data is an urgent problem to be solved.
[0014] Existing research includes similar patents as follows:
[0015] The patent CN113642456A, titled "A Gene Sequencing Data Indexing and Rapid Retrieval System," designs an index structure for gene data based on sequence characteristics and file paths, supporting retrieval by gene name, sequencing depth, and other information. However, in practice, after the retrieval, it is still necessary to read the file content to verify whether the data meets the conditions, making the efficiency unacceptable in large-scale data files.
[0016] The patent CN114816789A, titled "Unified Management and Retrieval Platform for Multi-Source Heterogeneous Scientific Data," integrates multi-source data through a unified metadata model and establishes metadata indexes and content summary indexes. However, it also establishes fine-grained indexes for each data entry separately. While this approach seems good, with large-scale files, such as hundreds of thousands or more, the number of indexes will rapidly explode to the point that any distributed cluster will find it difficult to handle. Summary of the Invention
[0017] The purpose of this invention is to provide a method for building, integrating, and retrieving custom indexes for large-scale, complex structured data.
[0018] The technical solution to achieve the purpose of this invention is: a method for constructing, integrating, and retrieving a custom index for large-scale complex structured data, comprising the following steps:
[0019] Custom index construction steps: Receive the index dimension configuration defined by the user through the interactive interface. The index dimensions include spatial dimension, feature dimension, and time dimension. The spatial dimension configuration supports multi-level grid definition, the time dimension configuration supports multi-level segmentation definition, and the feature dimension configuration includes definitions of numerical intervals, categorical labels, or string matching rules. Based on the configuration, generate a multi-dimensional combined-level index table structure. The index table is used to store the data volume statistics of each data file under multiple dimension combinations.
[0020] Data parsing and integration steps: When data files are entered into the database, the file content is parsed, and each data record is mapped to the corresponding spatial grid, feature interval, and time segment according to the index dimension configuration. The data volume of each file ID-spatial grid ID-feature interval ID-time segment ID is counted, and the statistical results are written into the index table.
[0021] Search steps: Receive user search requests containing spatial regions, feature conditions, and time ranges; convert user search requests into corresponding set of dimension identifiers, i.e., sets of various IDs that can be queried in the index table; query the index table based on the set of dimension identifiers to obtain data volume distribution information and a list of associated files; wherein, based on the difference between the dimension level of the user search request and the dimension level of the actual data storage, perform data volume aggregation or proportional flattening processing.
[0022] Furthermore, the grid level of the spatial dimension is associated with the level of detail of the GIS visualization field of view, and is used to dynamically determine the grid level used for the query based on the field of view range when the user searches.
[0023] Furthermore, the index table is a multi-dimensional combined level index table, and each record includes: a unique index identifier, an associated file ID, a spatial grid ID, a feature interval ID, a time segment ID, and the amount of data under that combination.
[0024] Furthermore, the data parsing and integration step also includes: before parsing the file content, automatically matching the original data fields in the file with the standard index fields; if the matching fails, prompting the user to confirm the mapping relationship.
[0025] Furthermore, the retrieval step also includes data aggregation and tiling processing:
[0026] If the grid level requested by the user is higher than the grid level where the data is actually stored, the amount of data in the coarse-grained grid is spread out to the requested fine-grained grid according to the proportion of the number of subgrids; if the grid level requested by the user is lower than the grid level where the data is actually stored, the amount of data in multiple fine-grained grids is aggregated into the requested coarse-grained grid.
[0027] Furthermore, the time-dimensional retrieval request processing includes similar aggregation and tiling logic, which adapts to the time segment level requested by the user and the actual time segment level of the data storage.
[0028] Furthermore, the element dimensions include three types: numerical, categorical, and string, wherein:
[0029] Numerical features support user-defined numerical range divisions to generate feature range IDs;
[0030] Categorical elements support user input of category tags for direct matching;
[0031] String-type elements support user-defined fuzzy matching rules.
[0032] A custom index building, integration, and retrieval system for large-scale complex structured data, used to implement the custom index building, integration, and retrieval method for such large-scale complex structured data, includes:
[0033] A custom index building module is used to receive user-defined index dimension configurations and generate multi-dimensional combined-level index table structures;
[0034] The data parsing integration module is used to parse the content when data files are imported into the database, perform dimension mapping and data volume statistics according to the configuration, and write the data to the index table.
[0035] The retrieval engine module is used to parse user retrieval requests, convert them into a set of dimension identifiers, query the index table, and return the results; the retrieval engine module is also used to aggregate or flatten data based on differences in dimension levels.
[0036] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the program, it implements the method for building, integrating, and retrieving custom indexes for large-scale complex structured data.
[0037] A computer-readable storage medium having a computer program stored thereon, characterized in that, when the program is executed by a processor, it implements the method for building, integrating, and retrieving custom indexes for large-scale complex structured data.
[0038] Compared with the prior art, the significant advantages of this invention are:
[0039] (1) It supports full-dimensional customization of spatial grid, feature interval and time segment, which solves the problem that the existing patent index dimensions are fixed and cannot be flexibly adapted to multi-source heterogeneous data. Without modifying the system implementation logic (i.e. modifying the code), it can be seamlessly adapted to various types of scientific data such as marine environment, gene sequencing, geophysics and so on through user interface configuration.
[0040] (2) The data statistics process is moved to the data parsing and integration module. When the file is put into the database, the dimension matching and data volume statistics are completed. During the retrieval stage, only index query and aggregation need to be performed. The data volume distribution can be obtained without traversing the file content.
[0041] (3) Adopt a multi-dimensional combined level index structure, and control the total number of indexes through a multi-level management method of spatial grid and time segmentation, so that it can reflect the data distribution and obtain the data distribution of large-scale (millions or even tens of millions of similar data files) data files. Attached Figure Description
[0042] Figure 1 This is a system structure block diagram of an embodiment of the present invention, illustrating the relationship between the custom index building module, the data parsing and integration module, and the retrieval engine module.
[0043] Figure 2 The flowchart for the custom index building module illustrates the process of a user configuring index dimensions and generating the index table structure.
[0044] Figure 3 The flowchart for the data parsing integration module illustrates the processes of file parsing, dimension mapping, data volume statistics, and index writing.
[0045] Figure 4 The flowchart for the retrieval engine module illustrates the process of parsing retrieval requests, querying index tables, performing aggregation / tiling processing, and returning results. Detailed Implementation
[0046] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0047] This invention provides a specific implementation method for constructing, integrating, and retrieving custom indexes for large-scale, complex structured data. The method mainly includes three core stages: custom index construction, data parsing and integration, and retrieval. A detailed description is provided below with reference to the accompanying drawings and modules.
[0048] (1) Custom index building module
[0049] Depending on the data file type and content, the index dimensions and structure are defined by the user for subsequent data parsing, while controlling the index size and improving retrieval efficiency.
[0050] The custom index building module needs to meet the following three requirements: First, users should be able to build targeted indexes based on the different types and contents of data files on the front-end interface; second, it should be able to quickly return the amount of data that meets the conditions in each file based on the input results of GIS selection and other latitude and longitude information and feature thresholds; and third, since the total number of data files of a certain type often exceeds millions or even tens of millions, it is necessary to ensure the speed of returning all results.
[0051] Before building the index, we need to control the total number of indexes; otherwise, if the number of data files is too large, the retrieval time will inevitably be unsatisfactory to users. Therefore, the design idea of this module is to index continuous latitude and longitude information into discrete data grids (the grid size can be specified by the user), and allow users to define several intervals for continuous feature values. For example, environmental monitoring data includes elements such as temperature, humidity, salinity, and wind force. The temperature interval can be defined by the user according to their research method as: -30° to -10°, -10° to 10°, 10° to 30°, and above 30°, etc. The time dimension and data information containing depth also adopt the interval division method.
[0052] To achieve the above requirements, this module provides the following customizable functions on the user interface and is divided into the following two parts:
[0053] A. User-defined configuration section
[0054] Users create a data file type through the front-end user interface. The system automatically provides three index types for users to configure: spatial dimension, feature dimension, and time dimension. The feature dimension is further subdivided into three sub-types: numeric, categorical, and string. The specific operation is as follows:
[0055] Spatial Dimension: Allows users to divide space into data grids of different sizes. The system provides preset templates such as 1°×1°, 0.5°X0.5°, and 0.1°×0.1°. Users can also input the grid size, such as 0.6°×0.6°.
[0056] Element dimensions: divided into three types.
[0057] Numerical data: such as continuous values like temperature, humidity, salinity, and wind speed. Users input or select core elements from the data and define their own numerical ranges. Although depth belongs to spatial information, it is also treated as a numerical dimension in this system.
[0058] Categorization: For example, the family and genus of a species. It supports users directly inputting category tags, rather than extracting them by parsing large amounts of file content, thus avoiding the negative impact of large file sizes on the user experience during the configuration process.
[0059] String type: Used to match strings that meet certain characteristics, such as abnormal or special values in biological genes, and can also be input by the user.
[0060] Time dimension: The processing approach is similar to that of numerical element dimensions, the difference being that the time dimension has a periodicity. After selecting the basic time granularity (such as year, quarter, month, day), users can construct custom time segments by defining periods (such as every 7 days, every 3 months, every 1 year, etc.).
[0061] B. Index storage structure definition section
[0062] This section establishes a logical structure for a multi-dimensional combined-level index based on user configuration. This index is used to store statistical values of data volume for a single data file under the combination of spatial grid, feature interval, and time segmentation.
[0063] The index table design includes the following core fields: unique identifier for the composite index, associated file ID (which stores the actual file path and other metadata information such as source, uploader, and organization in another file information table), as well as spatial grid ID, feature interval ID, and time segment ID.
[0064] By adopting the above approach, the total size of the index can be effectively controlled within a reasonable range. For example, if a certain type of data contains 10 million files, with each file covering an average of 10 grids, containing 5 types of numerical elements divided into 4 intervals each, and 12 time segments, then the total index size would be 10 million × 10 × 5 × 4 × 12 = 240 million records. This order of magnitude is within the capacity of all current distributed structured / semi-structured databases or full-text search engines like Elasticsearch.
[0065] However, this approach needs to address a potential problem: if different data files have vastly different temporal and spatial spans (for example, one file contains data from the entire Indian or Pacific Ocean, spanning 1960-2010, while other files only contain daily observation data from the coastal waters of Yantai and Shanghai), forcing the use of fine-grained grids (0.1°×0.1°) and time segments (once a week) for indexing would lead to a dramatic increase in the number of indexed files with large spans. Conversely, if too coarse granularity is used, the research value of some data files would be spread across a larger granularity, making it difficult for research users to discover them.
[0066] To address this, the present invention introduces the concept of multi-level granularity during configuration and retrieval. For example, a spatial data grid can include multiple levels such as Level 1 (1°×1°), Level 2 (0.5°×0.5°), and Level 3 (0.1°×0.1°); a time segment can include multiple levels such as Level 1 (year), Level 2 (month), Level 3 (week), and Level 4 (day). In the user-defined configuration interface, users can select a suitable storage grid level and time segment level for each file type based on their understanding of the data files. When a user performs a backend retrieval, if the granularity level corresponding to the requested field of view or time range is inconsistent with the data storage level, the system will perform adaptation processing through the aggregation and tiling mechanism described later.
[0067] After the user completes the configuration and clicks save, the system will automatically generate the physical index table structure corresponding to this type of data based on the information entered.
[0068] (2) Data parsing and integration module
[0069] This module, based on user-defined index configurations, parses file content and statistically analyzes data distribution during the data file import process, then populates the index table with the statistical results. This essentially moves the data statistical work, which researchers typically perform during retrieval, to the data import and parsing stage. Since users generally have a higher tolerance for data import processing time than for interactive retrieval, this design sacrifices some import time in exchange for optimal retrieval performance.
[0070] The data parsing integration module mainly consists of two parts: file content parsing and data index writing. It should be noted that the data files typically undergo preprocessing such as format conversion and cleaning before this module runs.
[0071] A. File Content Parsing Section
[0072] The first step is to read the file metadata and automatically match the original data field names within the file with the standard fields defined in the custom index dimension. If field name mismatches exist, they can be correlated using a preset data dictionary or rules. If automatic matching fails, a system alarm is triggered, prompting the user to manually confirm or establish a mapping relationship.
[0073] The second step is to read the file data content, complete the matching of spatial, temporal, and feature dimensions, and perform data volume accumulation and statistics, including the following:
[0074] Spatial Dimension: Extract the latitude and longitude information of each data point and calculate the spatial grid ID to which the latitude and longitude belong based on the spatial grid level selected by the user;
[0075] Time dimension: Extract the file collection time, and calculate the time segment ID to which the collection time belongs based on the time segment level selected by the user;
[0076] Element Dimension: For numerical elements, match the user-defined element range to which it belongs to obtain the element range ID; for categorical elements, directly match the category label entered by the user; for string elements, determine whether it is associated with a specific string defined by the user through fuzzy matching rules.
[0077] The third step is data volume statistics. Based on the combination dimension of file ID + spatial grid ID + feature interval ID + time segment ID, the data volume that meets the conditions under each combination is counted in real time (original data volume + 1).
[0078] B. Data Index Writing Section
[0079] Write the two types of results obtained from parsing the file content into the database:
[0080] Write file metadata information (such as file ID, path, source, etc.) into a separate file information table.
[0081] The statistical results generated in the second and third steps, namely each combination of file ID-spatial grid ID-feature interval ID-time segment ID and its corresponding data volume, are written one by one into the multi-dimensional combination-level index table corresponding to this data type.
[0082] (3) Search engine module
[0083] When a user submits a search request, the search engine module queries the index table based on the user's selected (or input) area information, feature range, and time range, and returns the distribution of data that meets the criteria, as well as a list of associated data files. This allows research users to assess the value of data files and decide whether to download them. This module specifically comprises the following three parts:
[0084] A. Search Request Parsing Section
[0085] This section receives user search requests from application systems. These requests typically include: a GIS region (a polygon composed of multiple latitude and longitude points), feature type and its specific values / ranges, and date and time ranges. This section is responsible for converting these request parameters into a set of dimension identifiers (IDs) that can be directly queried in the index table.
[0086] Spatial Request Transformation: Based on the region information and Level of Detail (LOD) carried in the request, it is transformed into a list of spatial grid IDs. The system internally predefines the correspondence between the LOD of the GIS visualization view and the data grid level. First, it determines the grid level to be used for the query based on the current view, then calculates the IDs of all grids of that level covered by the polygonal region, generating an ID list. For multi-level grid data, its parent grids must also be considered.
[0087] Element request transformation: Based on the element type, the conditions entered by the user are transformed into query conditions;
[0088] If it is a numerical value, the user-input feature threshold range will be matched with the feature interval IDs in the custom index to generate a list of feature interval IDs.
[0089] If the data is categorized, the user-input category tags will be directly converted into equivalent query conditions;
[0090] If it is a string, convert the user-defined pattern into a fuzzy matching query condition supported by the database.
[0091] Time Request Conversion: Convert the user-input date and time range into a list of time segment IDs. If the request includes time segment level information, the time range is directly split according to that level. If not specified, the system automatically matches a level based on the length of the time range according to preset rules (e.g., more than 3 months are classified as monthly level, less than 1 month as daily level) and then generates the corresponding list of time segment IDs.
[0092] B. Data Query and Calculation Section
[0093] Using the spatial grid ID list, feature interval ID list (or query conditions), and time segment ID list generated in the previous section, perform a query on each item in the multi-dimensional combined level index table, filter out index entries that match all dimension identifiers, summarize the data volume under each spatial grid, and collect the associated file ID list.
[0094] C. Data Aggregation and Tiling Processing
[0095] This is a critical point where the dimensional level used to process user requests differs from the dimensional level of the actual data storage.
[0096] Tiling: When the user requests a grid level that is finer than the actual data storage grid level, the data volume in the coarse-grained grid needs to be proportionally distributed across its multiple fine-grained sub-grids. For example, if the user requests a grid detail level of 3 (fine-grained), while the target data file's storage grid level is 2 (coarse-grained), and one level 2 coarse-grained grid contains nine level 3 fine-grained sub-grids, storing a total of 90 data entries, and the user's field of view only covers four of these nine fine-grained sub-grids, the tiling strategy will distribute the data volume evenly, allocating 10 data entries to each fine-grained sub-grid. While this approach cannot perfectly match the actual data distribution, it effectively avoids severe performance degradation caused by time index explosion, representing a compromise between data distribution characteristics and system performance. Other: Time-dimensional data processing can be performed using the same logic.
[0097] Aggregation: Defined as the opposite of tiling, when the grid level requested by the user is lower than (i.e., coarser than the storage grid) the actual grid level where the data is stored, the data volume of multiple fine-grained grids needs to be accumulated into their respective coarse-grained grids. For example, if a request is made to query a first-level grid, but the data is indexed by a second-level grid, the data volume of multiple second-level grids needs to be summed as an estimate of the first-level grid to which it belongs.
[0098] In summary, this invention provides a complete solution for managing large-scale, complex data structures through the collaborative work of the three modules mentioned above. First, users can flexibly customize index dimensions to adapt to multi-source data. Then, during data import, time-consuming parsing and statistical work is completed, and a lightweight composite index is built. Finally, during retrieval, efficient index queries and an intelligent granularity adaptation mechanism quickly return data distribution results. This invention is applicable to data acquisition scenarios such as marine environment, geophysics, and biological genes, supporting research users in achieving efficient data location and high-quality file selection through a complete workflow including GIS selection, threshold adjustment, value judgment, and file download.
[0099] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0100] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A method for constructing, integrating, and retrieving custom indexes for large-scale, complex structured data, characterized in that, Includes the following steps: Custom index construction steps: Receive the index dimension configuration defined by the user through the interactive interface. The index dimensions include spatial dimension, feature dimension, and time dimension. The spatial dimension configuration supports multi-level grid definition, the time dimension configuration supports multi-level segmentation definition, and the feature dimension configuration includes definitions of numerical intervals, categorical labels, or string matching rules. Based on the configuration, generate a multi-dimensional combined-level index table structure. The index table is used to store the data volume statistics of each data file under multiple dimension combinations. Data parsing and integration steps: When data files are entered into the database, the file content is parsed, and each data record is mapped to the corresponding spatial grid, feature interval, and time segment according to the index dimension configuration. The amount of data under each combination of file ID-spatial grid ID-feature interval ID-time segment ID is counted, and the statistical results are written into the index table. Search steps: Receive user search requests containing spatial regions, feature conditions, and time ranges; convert user search requests into corresponding set of dimension identifiers, i.e., sets of various IDs that can be queried in the index table; query the index table based on the set of dimension identifiers to obtain data volume distribution information and a list of associated files; wherein, based on the difference between the dimension level of the user search request and the dimension level of the actual data storage, perform data volume aggregation or proportional flattening processing.
2. The method for constructing, integrating, and retrieving custom indexes for large-scale complex structured data according to claim 1, characterized in that, The grid level of the spatial dimension is associated with the level of detail of the GIS visualization field of view, and is used to dynamically determine the grid level used for the query based on the field of view range when the user searches.
3. The method for constructing, integrating, and retrieving custom indexes for large-scale complex structured data according to claim 1, characterized in that, The index table is a multi-dimensional combined level index table. Each record includes: unique index identifier, associated file ID, spatial grid ID, feature interval ID, time segment ID, and the amount of data under that combination.
4. The method for constructing, integrating, and retrieving custom indexes for large-scale complex structured data according to claim 1, characterized in that, The data parsing and integration steps also include: before parsing the file content, automatically matching the original data fields in the file with the standard index fields; if the matching fails, prompting the user to confirm the mapping relationship.
5. The method for constructing, integrating, and retrieving custom indexes for large-scale complex structured data according to claim 1, characterized in that, The retrieval steps also include data aggregation and tiling: If the grid level requested by the user is higher than the grid level where the data is actually stored, the amount of data in the coarse-grained grid will be spread out to the requested fine-grained grid in proportion to the number of subgrids. If the grid level requested by the user is lower than the grid level where the data is actually stored, the data volume from multiple fine-grained grids will be aggregated into the requested coarse-grained grid.
6. The method for constructing, integrating, and retrieving custom indexes for large-scale complex structured data according to claim 5, characterized in that, The time-dimensional retrieval request processing includes similar aggregation and tiling logic, which adapts to the time segment level requested by the user and the actual time segment level of the data storage.
7. The method for constructing, integrating, and retrieving custom indexes for large-scale complex structured data according to claim 1, characterized in that, The element dimensions include three types: numerical, categorical, and string. Numerical features support user-defined numerical range divisions to generate feature range IDs; Categorical elements support user input of category tags for direct matching; String-type elements support user-defined fuzzy matching rules.
8. A custom index building, integration, and retrieval system for large-scale complex structured data, characterized in that, A method for building, integrating, and retrieving custom indexes for large-scale, complex structured data as described in any one of claims 1 to 7, comprising: A custom index building module is used to receive user-defined index dimension configurations and generate multi-dimensional combined-level index table structures; The data parsing integration module is used to parse the content when data files are imported into the database, perform dimension mapping and data volume statistics according to the configuration, and write the data to the index table. The retrieval engine module is used to parse user retrieval requests, convert them into a set of dimension identifiers, query the index table, and return the results; the retrieval engine module is also used to aggregate or flatten data based on differences in dimension levels.
9. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements a custom index construction, integration, and retrieval method for large-scale complex structured data as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements a custom index construction, integration, and retrieval method for large-scale complex structured data as described in any one of claims 1 to 7.