A method for constructing an index and retrieving data for large-scale gene expression data
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
- Filing Date
- 2023-03-13
- Publication Date
- 2026-06-30
AI Technical Summary
In large-scale gene expression data, how to perform continuous streaming processing, transformation and parsing, construct a unified search view and display key information, and handle the relationships between data to improve user search efficiency and data utilization effectiveness.
We employ streaming processing technology to parse large data files, construct an interruptible-resume parsing program, utilize a NoSQL database to store gene expression data of different categories, handle the relationships between data through association IDs, provide a unified search view and display differential information, and leverage ElasticSearch to realize the storage and retrieval of indexes and association relationships.
It enables efficient and interruptible processing and unified view display of large-scale gene expression data, improves the efficiency and accuracy of data retrieval, reduces storage space waste, and meets the requirements for building a unified view of large-scale gene expression data.
Smart Images

Figure CN116414834B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of applied bioinformatics technology and relates to a method for constructing an index for large-scale gene expression data and a data retrieval method. It is mainly applied to data mining methods using publicly shared gene expression data in biological, genetic and other related fields. Background Technology
[0002] Gene expression data contains rich information about gene activity and can be used to study biological questions such as perturbation genes, intergeneric relationships, and gene regulatory mechanisms. Gene expression data analysis has been widely applied in biomedical fields such as gene function prediction, disease mechanisms, clinical diagnosis and treatment, drug screening and development, and pathogen infection.
[0003] With the development of high-throughput detection technologies such as cDNA microarrays, researchers can qualitatively or quantitatively detect gene expression products at the whole-genome level. A large amount of public gene expression data has been accumulated; for example, NCBI's GeneExpression Omnibus (GEO) Database is an international public repository used to archive and freely distribute microarrays, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community (Tanya Barrett, Ron Edgar, Methods in Enzymology, 2006). As of November 2022, GEO contained over 5.6 million datasets from 24,523 platforms, comprising over 5.39 million samples.
[0004] In publicly shared gene expression data, Samples records describe the metadata of a single sample and the gene expression abundance obtained. Platforms records consist of a summary description of the microarray or sequencer; for microarray-based platforms, this also includes a data table defining the template. Series records link a group of related samples together and provide a focus and description of the entire study. Series records may also contain tables describing data extraction, summarizing conclusions, or analysis. Datasets represent collections of genomic samples that are biologically and statistically comparable. Profiles records contain gene expression maps derived from the dataset database. Profiles data show the expression levels of individual genes in each dataset across all samples, and can be used to quickly determine whether a gene is differentially expressed under different experimental conditions. These five types of data are stored separately by category but are interconnected. For example, a platform record can reference multiple samples from different submitters, while a sample entity can only reference one platform but can be included in multiple series.
[0005] Currently, integrating public data has become a crucial requirement for gene expression data analysis. The sheer volume of public data necessitates continuous streaming processing and unified view presentation techniques in large-scale gene expression data mining and analysis. Furthermore, user efficiency also impacts the effectiveness of data utilization. Therefore, in practical applications, particularly in the implementation of data association and precise retrieval for large-scale gene expression data, key issues to be addressed include: how to perform continuous streaming processing on large data files; how to provide users with a unified view while selectively displaying key information for each data type; how to process and store tabular data for user retrieval and information acquisition; and how to return related data and provide direct links after user searches. Summary of the Invention
[0006] To address the problems existing in the prior art, the purpose of this invention is to provide a method for constructing an index for large-scale gene expression data and a method for data retrieval.
[0007] This invention proposes streaming processing for large data files and transformation and parsing for large matrix files to solve the difficulties in the actual index construction process. Simultaneously, this invention constructs a unified retrieval view based on multiple underlying indexes and provides a data retrieval method capable of obtaining the relationships between data to provide users with a more complete data view.
[0008] Therefore, the present invention needs to solve the following specific problems:
[0009] (1) Continuous streaming processing for large data files: The raw gene expression data is represented by a large number of independent compressed files. An interruptible-resume parsing program is constructed so that parsing can continue from the point of interruption when interruption occurs.
[0010] (2) For the conversion and parsing of large matrix files: Some data records contain tabular data in text format. These data need to be extracted and converted into corresponding memory models such as two-dimensional matrices. Then, the two-dimensional matrices are calculated according to various data characteristics. During parsing, the corresponding results are filled into the corresponding fields so that users can search or upper-level applications can visualize the data.
[0011] (3) Provide a unified search view and provide key information for each type of data: Consider the differences between each type of data and build a separate parser for each type of data, but this difference should be transparent to the user; the search results should consider the differences between each type of data, that is, in addition to displaying common information, it is also necessary to display information unique to each type of data.
[0012] (4) Processing the relationships between data, including storage and citation: Gene expression data are related to each other. How to display this relationship when searching and presenting results requires extracting and storing this relationship when constructing metadata.
[0013] The present invention solves the above problems by means of:
[0014] Regarding problem (1): This invention stores the absolute paths of all target files to be parsed line by line in a text file. During the parsing process, it tracks the line of the file currently being parsed. If an interruption occurs, the line number is recorded and persisted to the hard disk or saved to another cache component. When the parsing program resumes, the line number is read and the files before that line number are ignored so that the parsing can continue from the file that was interrupted last time.
[0015] Regarding question (2): The original data includes key:value row data and tab-separated table records. For the table data, the processing method of this invention is to first read the original file row by row and store the tab-separated table records into a temporary CSV file. Then, when constructing the metadata, the temporary CSV file is read and its contents are used to construct a two-dimensional matrix. Matrix processing tools are then used to process the rows and columns and fill in the key-value pairs.
[0016] Regarding question (3): This invention constructs different parsers for different categories of gene expression data. The construction methods of each parser are basically the same, but due to different target data, there are differences in the specific data processing. For example, different fields need to be extracted, and the values of the extracted fields are not exactly the same in the original file. Common information for each category is stored as a key:value pair with the same name; unique information for each category is stored as different key:value pairs; information that does not need to be retrieved but needs to be displayed is stored only, as above. Figure 1 The index section is shown in the diagram. The parsed data is indexed and stored using NoSQL, and the transparency of index differences is achieved using the characteristics of NoSQL. This invention constructs a corresponding metadata database for each category of gene expression data, sets each metadata database to have the same alias, and exposes this alias only to upper-layer applications. The search results return all fields, which the upper-layer application populates and displays to the user according to the data type.
[0017] Regarding question (4): The subcategories of gene expression data are interconnected, and the retrieval results need to display the target data and other related categories of data entries. Therefore, when constructing the index metadata dictionary, the IDs of related data need to be stored and indexed together. On the other hand, when displaying detailed information about a certain data, there are duplicate many-to-many mappings, meaning that certain types of related information may be referenced repeatedly. Storing these duplicate related entity information would result in a huge waste of storage space. This invention draws on the processing method of multi-table association through foreign keys in relational databases, storing only the ID of the related data in each record, while uniquely storing the related data entity information in the corresponding data record. This approach requires an additional query (i.e., obtaining the related entity through the related ID) when constructing the complete query results.
[0018] The technical solution adopted by the present invention to solve the above problems is as follows:
[0019] A method for constructing an index for large-scale gene expression data, comprising the following steps:
[0020] 1) Construct a corresponding parser for each category of gene expression data; the parser corresponding to the gene expression data of category i is denoted as parser i; i = 1 to N, where N is the total number of categories;
[0021] 2) For each gene expression data in category i, parser i is used to parse it, obtain the metadata of the gene expression data and save it to a JSON document. Then, according to the retrieval requirements, different fields in the JSON document of category i are set to different index types, such as text, keyword, dates, etc., to build the index of category i.
[0022] 3) Set the same alias for the indexes of data of the same category and expose the alias only to the upper layer application. Set the same name and index setting for the keys with the same semantics in each index. Take the union of the keys of each index and project the key values returned by each type of index based on this union to obtain the common information and unique information of each type of data and generate a unified search view.
[0023] Furthermore, the gene expression data includes DataSets, Series, Platforms, Samples, and Profiles.
[0024] Furthermore, the parser corresponding to the Profiles data parses the Profiles data in the following way: First, for the two-dimensional table data in the Profiles data, calculate the rank value of each sample in the two-dimensional table according to the columns of the two-dimensional table, and calculate the variance of different samples in the two-dimensional table according to the rows of the two-dimensional table; then take the maximum and minimum values of the rank values according to the rows and fill them into the JSON document corresponding to the Profiles data.
[0025] Furthermore, if an interruption occurs during the parsing of the target file, the corresponding parser will be restarted and will start from the file that was not processed last time. It will read the target text file line by line, and perform string splitting and pattern matching on the read lines as needed, extract metadata and store it in the corresponding JSON document.
[0026] Furthermore, when the parser parses the target file, it first obtains the absolute path of the target file and stores it in the text file path_list.txt, and maintains two global variables: curProcessLine to record the line number of the target file currently being processed, and curCounter to record the number of documents currently processed; if an interruption occurs, the two global variables curProcessLine and curCounter are persisted; when the corresponding parser is restarted, the two variables curProcessLine and curCounter are obtained.
[0027] A data retrieval method based on a constructed index, comprising the following steps:
[0028] 1) The application layer queries the aliases in the index based on the document ID to be queried, and returns all metadata containing that document ID;
[0029] 2) Based on the returned metadata containing the document ID, query other datasets to obtain the document containing the document ID and the associated data documents;
[0030] 3) Generate a data view based on the metadata of the data document returned in step 2).
[0031] The advantages of this invention are as follows:
[0032] This paper proposes an indexing method for large-scale gene expression data, addressing the construction of a unified view for different types of gene expression data, and introduces a method for processing tabular data in textual representation. For the relationships between different types of gene expression data, a method for storing and referencing these relationships is presented. Furthermore, for large datasets, an interruptible-resume parsing method is provided. Attached Figure Description
[0033] Figure 1This is a flowchart of gene expression data processing.
[0034] Figure 2 Hide different parsing index graphs for the API implementation.
[0035] Figure 3 This is an example diagram of the processing of experimental records for a two-dimensional matrix.
[0036] Figure 4 A graph is used to reference relationships. Detailed Implementation
[0037] The present invention will now be described in further detail with reference to the accompanying drawings. The examples given are only for explaining the present invention and are not intended to limit the scope of the present invention.
[0038] This invention uses GEO raw data processing as an implementation scheme. GEO raw data is represented as a key:value pair where each row contains a key, which may be repeated to represent a value spanning multiple rows. Experimental data consists of tab-separated numerical records per row. While the format of each data type is generally consistent, the specific information differs, necessitating separate parsing processes, i.e., separate parsers. Furthermore, because datasets, series, platforms, samples, and profiles are interconnected, this relationship must be considered during processing to preserve it and reduce data redundancy.
[0039] Considering the characteristics of the raw data, the specific implementation process of this invention to address the above problems is as follows:
[0040] Specific procedures:
[0041] For the five types of GEO raw data mentioned above, an index is built for each type of data separately. The raw target data for each type of data is represented as a series of compressed text files. The parsing program processes these target data files sequentially. If the processing is interrupted, the parser restarts and starts from the last unfinished file, ignoring the already processed files. In the specific parsing process for each type of data, the parser reads the target text file line by line and performs string splitting, pattern matching, and other operations on the read lines as needed to extract fields and build a JSON document. It also ensures that information with the same semantics is extracted and constructed as identical key-value pairs. If the raw text file contains tabular data that requires retrieval, it is processed as a matrix. The association information of the currently parsed data entries is extracted and stored in the JSON document. After the indexes for the five sub-types of data are built, a unified logical index is constructed based on these indexes using the features of the storage index engine. Data retrieval is performed based on this unified logical index.
[0042] (1) Continuous streaming processing that can be interrupted
[0043] In the specific implementation, the absolute path of the target file is first obtained using `os.walk` and stored in a text file (path_list.txt). During parsing, two global variables are maintained: `curProcessLine` indicates which line in `path_list.txt` is currently being processed (it records the line number of the file in `path_list.txt`), and `curCounter` is used to build a counter to indicate how many documents have been processed. If processing is interrupted for any reason, these two global variables are persisted. When processing resumes, these two variables are read from the hard drive. `curProcessLine` is used to ignore files with line numbers lower than this. This implementation also facilitates handling reprocessing needs—simply reset these two variables to 1. In the parsing script implementation, command-line arguments allow the user to choose whether to re-parse or continue parsing. Furthermore, for data being parsed during an interruption, some documents may have already been entered into the database; using the unique ID of each document as the value of the special field `_id` avoids duplicate entries of the same document into the database.
[0044] (2) Processing of matrix data
[0045] The GEO raw data contains tabular data in several places. For example, the raw data for the samples is a series file. One series record corresponds to multiple sample records. The sample records corresponding to series 67781 include sample 1655691, sample 1655692, sample 1655693, etc., as shown in Table 1.
[0046] Table 1 Example Experimental Data
[0047] ID_REF Sample 1655691 Sample 1655692 Sample 1655693 …… 8295_0003_0035 0.919258137 0.697608257 0.457529294 …… 8295_0003_0041 -0.156084579 -0.394908346 -0.023449314 ……
[0048] The sample data mentioned above actually requires column data as the value of one of its fields.
[0049] For example, Profiles data requires first calculating the rank value (rank) of samples by column in a two-dimensional table, then calculating the variance of different samples by row, and finally filling each Profile data entry with the values from each row. Specifically, the csv module is used to map the CSV file into a two-dimensional matrix in memory, and the DataFrame from the pandas module is used to perform row, column, and variance calculations on the matrix. Finally, the maximum and minimum rank values are taken by row and filled into a JSON document for storage in Elasticsearch. The calculation process is as follows: Figure 3As shown, the specific steps are: ① Calculate the standard deviation by row, ② Calculate the rank value by column, ③ Take the percentage, ④ Take the maximum and minimum rank values by column, and ⑤ Take the original array by row.
[0050] (3) Provide a unified search view and provide differentiated information for each type of data.
[0051] In the specific implementation, ElasticSearch is used as the metadata storage and indexing engine. A parser is built for each type of data to produce JSON-formatted documents and store them in different indexes. Transparency is achieved as follows: The same alias is set for each type of data's index, and this alias is only exposed to the upper-layer application. For different indexes, semantically identical keys have the same name and index settings, so searching in a certain field will be searched across all metadata. For example, using the same name such as TaxonmyID for species IDs will retrieve all platforms, samples, series, datasets, and expression profiles for a given TaxonmyID value. To specifically display various types of data, the API implementation needs to perform a union of the keys related to each index, and then project the key-value pairs returned by each type of index based on this union to obtain the common and unique information of each type of data. The interface implementation is as follows: Figure 2 As shown.
[0052] (4) Handling relationships between data
[0053] The data in each subclass of GEO are interconnected, as shown in Table 2 below.
[0054] Table 2. Relationships among various GEO data types
[0055]
[0056]
[0057] In practice, when parsing the target data, the IDs of these related data are extracted from the original data and combined with the target data's own ID to form an array, which is then stored as a one-key value pair. For example, according to Table 2, datasets, platforms, series, and expression profiles are all associated with samples. Therefore, when these data are processed into metadata, the sample ID is included in their ID field. Based on ElasticSearch's inverted index characteristics, when querying for sample ID 1655691 in the index alias, the query will return all documents whose IDs contain sample 1655691, i.e., this data and all related data documents.
[0058] The references to relationships borrow from the concepts of pointers (references) in programming languages and foreign keys in databases, storing only the IDs of related data in the JSON document, rather than the entity information of the related data. For example, each record in an expression profile dataset contains information about a series of samples belonging to its dataset: the sample name, experimental value, and the rank of the value, etc. The sample name is consistent for expression profile records belonging to a specific dataset, and typically a dataset corresponds to thousands to tens of thousands of expression profile records. Storing the sample name information for each expression profile record would result in a huge storage space consumption. Therefore, the sample name information, as an entity, is stored as an attribute of the dataset record, while only the dataset ID is stored in the expression profile record. When retrieving an expression profile record, all information is first searched in the expression profile index, and then an additional dataset index query is performed based on its dataset attribute to retrieve the sample name information, thus constructing the complete expression profile information. The above illustration of retrieving complete metadata information based on relationships is shown below. Figure 4 As shown.
[0059] The application of this correlation between different data is carried out throughout the entire gene expression data analysis process.
[0060] Taking expression profile data query as an example, when querying data records related to a specific ProfileID via the application programming interface (API), the process first involves querying the expression profile index based on the ProfileID to obtain the metadata information of that profile data record. This metadata includes the dataset ID to which the profile data record belongs. From this dataset ID, the dataset index is used to retrieve Samples information. The API integrates data from different indexes to construct the profile data and related data from other categories. The specific steps are: ① The application queries the API interface for all metadata information of the expression profile based on the ProfileID; ② The API interface queries the expression profile index based on the ProfileID; ③ The metadata information of the profile data record is returned, including the dataset ID to which the profile data record belongs; ④ The dataset ID is used to query the dataset index; ⑤ The dataset returns the queried Samples information to the API interface; ⑥ The API interface constructs complete profile metadata based on the returned Samples information and returns it to the application.
[0061] Although specific embodiments of the invention have been disclosed for illustrative purposes to aid in understanding and implementing the invention, those skilled in the art will understand that various substitutions, variations, and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the content disclosed in the preferred embodiments, and the scope of protection claimed by the invention is defined by the claims.
Claims
1. A method for constructing an index for large-scale gene expression data, comprising the following steps: 1) Construct a corresponding parser for each category of gene expression data; The parser corresponding to the gene expression data of category i is denoted as parser i; i = 1 to N, where N is the total number of categories; the categories include datasets, series, platforms, samples, and expression profiles; 2) For each gene expression data of category i, parser i is used to parse it to obtain the metadata of the gene expression data and save it to a JSON document. Then, according to the retrieval requirements, different fields in the JSON document of category i are set with different index types to obtain the index of category i. When constructing the JSON document, the IDs of related data are stored and indexed together. 3) Set the same alias for the indexes of data of the same category and expose the alias only to the upper layer application. Set the same name and index setting for the keys with the same semantics in each index. Take the union of the keys of each index and project the key values returned by each type of index based on this union to obtain the common information and unique information of each type of data and generate a unified search view.
2. The method according to claim 1, characterized in that, The parser corresponding to the expression profile parses the expression profile in the following way: First, for the two-dimensional table data in the expression profile, calculate the rank value of each sample in the two-dimensional table according to the columns of the two-dimensional table, and calculate the variance of different samples in the two-dimensional table according to the rows of the two-dimensional table; then take the maximum and minimum values of the rank values according to the rows and fill them into the JSON document corresponding to the expression profile.
3. The method according to claim 1 or 2, characterized in that, If an interruption occurs during the parsing of the target file, the corresponding parser will be restarted and will start from the file that was not processed last time. It will read the target text file line by line, and perform string splitting and pattern matching on the read lines as needed, extract metadata and store it in the corresponding JSON document.
4. The method according to claim 3, characterized in that, When the parser parses the target file, it first obtains the absolute path of the target file and stores it in the text file path_list.txt. It maintains two global variables: curProcessLine, which records the line number of the target file being processed, and curCounter, which records the number of documents that have been processed. If an interruption occurs, the two global variables curProcessLine and curCounter are persisted. When the corresponding parser is restarted, the two variables curProcessLine and curCounter are retrieved.
5. A data retrieval method based on the index constructed according to claim 1, characterized in that, Based on the document ID to be queried, a query is performed in the alias of the index to return all metadata containing that document ID. Then, based on the returned metadata containing that document ID, the dataset is queried to obtain the document containing that document ID and the data document associated with it.