Method and apparatus for storing and collecting statistical information of graph database based on LSM tree

By generating and storing the key information and index key information of the target data in the LSM tree, the problem of low reading and query efficiency of graph databases is solved, and the information of nodes and edges with the same target characteristics can be quickly counted, thus improving the efficiency of information statistics.

CN118861367BActive Publication Date: 2026-06-23TSINGHUA UNIVERSITY +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2024-07-05
Publication Date
2026-06-23

Smart Images

  • Figure CN118861367B_ABST
    Figure CN118861367B_ABST
Patent Text Reader

Abstract

The present disclosure relates to the technical field of data storage, and includes a graph database storage and statistical information collection method and device based on an LSM tree. By responding to a write request of target data, a data type represented by the target data in a graph database is obtained; key information of the target data is generated according to a preset encoding rule and the data type matched with an information statistical requirement; the information statistical requirement is used to indicate statistics of nodes and edges with a target feature; the preset encoding rule includes that, for each group of target nodes and target edges with a connection relationship, first key information of the target nodes and second key information of the target edges include a same key part, and the key part is used to indicate the target feature; a first key-value pair including the key information is generated; and the target data is stored into an in-memory table according to the first key-value pair based on an LSM tree; nodes and edges with the same target feature can be obtained by scanning the key part at one time, and the information statistical efficiency is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of data storage technology, and in particular to a graph database storage and statistical information collection method and apparatus based on LSM trees. Background Technology

[0002] Graph databases are a type of non-relational database that uses graph theory to store, map, and analyze data. In a graph database, data is stored in the form of nodes and edges, where nodes represent entities (i.e., objects or concepts in the real world) and edges represent relationships between entities.

[0003] To adapt to large-scale graph database scenarios with nodes and edges, a Log-Structured-Merge-Tree (LSM) can be used to store the graph database data. LSM tree storage is a sequential, multi-level key-value (KV) storage based on disk writes. The read / write operation process of an LSM tree includes: upon receiving a write request, the data to be written is directly written to an in-memtable as a cache for the graph database. When the memtable is full, the memory block is transformed into an immutable memtable, and a new memtable is generated for subsequent write requests. The contents of the immutable memtable are then written to disk, forming a Sorted String Table (SSTable) file. Both the in-memtable and the immutable memtable are sorted according to the data keys; therefore, the SSTable file is also sorted by key. Furthermore, SSTable files are organized hierarchically. Level 0 is written directly from memory. When the data in Level 0 reaches a certain size, it is automatically merged into Level 1; when Level 1 reaches a certain size, it will continue to be merged into higher levels, and so on. During the merging process, duplicate or deleted data will be removed.

[0004] When a read request is received, the system first searches in the memtable in memory. If the file is not found, the system starts searching layer by layer from the file at level 0.

[0005] As can be seen from the above LSM tree writing and reading process, although LSM tree-based graph databases can provide high data writing efficiency, complex merging and layer-by-layer search operations can lead to low data reading and query efficiency, resulting in low efficiency in statistical information collection. Summary of the Invention

[0006] In view of this, this disclosure proposes a graph database storage and statistical information collection method and apparatus based on LSM trees, which can obtain nodes and edges with the same target characteristics in one go by scanning the key part of nodes and edges, thereby improving the efficiency of information statistics.

[0007] According to one aspect of this disclosure, a graph database storage method based on an LSM tree is provided, the method comprising:

[0008] In response to a write request to write target data, the data type of the target data represented in the graph database is obtained; wherein, the data type in the graph database includes nodes and edges for connecting different nodes;

[0009] According to a preset encoding rule matching the information statistics requirements and the data type, key information of the target data is generated; the information statistics requirements are used to indicate the nodes and edges with target features; the preset encoding rule includes: for each group of target nodes and target edges with connection relationships, the first key information of the target node and the second key information of the target edge include the same key part, the key part is used to indicate the target features; wherein, the key information is the first key information or the second key information;

[0010] Generate a first key-value pair including the key information;

[0011] Based on the LSM tree, the target data is stored in a memory table according to the first key-value pair.

[0012] In one possible implementation, the information statistics requirement is used to indicate the statistics of edges sharing the same node, and the target features include node features; accordingly,

[0013] The step of generating key information for the target data according to a preset encoding rule matching the information statistics requirements and the data type includes:

[0014] When the data type of the target data is a node, the first key information is generated based on the node identifier of the target data and the data type.

[0015] When the data type of the target data is an edge, obtain the first key information of the target node connected by the target edge represented by the target data; generate the second key information based on the node identifier in the first key information, the edge identifier of the target data, and the data type;

[0016] The key portion includes the node identifier, which is used to indicate the node characteristics.

[0017] In one possible implementation, the information statistics requirement is used to instruct the statistics of nodes with the same label and the edges connected to each node with the same label, wherein the target features include the label features of the nodes and the node features; accordingly,

[0018] The step of generating key information for the target data according to a preset encoding rule matching the information statistics requirements and the data type includes:

[0019] When the data type of the target data is a node, the first key information is generated based on the node identifier, the node tag identifier, and the data type of the target data;

[0020] When the data type of the target data is an edge, obtain the first key information of the target nodes connected by the target edge represented by the target data; generate the second key information based on the node identifier, the node label identifier, the edge identifier of the target data, and the data type in the first key information;

[0021] The key portion includes the tag identifier of the node and the node identifier, wherein the tag identifier is used to indicate the tag feature and the node identifier is used to indicate the node feature.

[0022] In one possible implementation, the key portion is a prefix of the key information.

[0023] In one possible implementation, the information statistics requirement is further used to instruct the statistics of value information with the same local key information in each first key-value pair; correspondingly, the method further includes:

[0024] An index key is generated based on the local key information and the value information in the first key-value pair;

[0025] Generate an index value based on the key information in the first key-value pair;

[0026] The second key-value pair, consisting of the index key and the index value, is stored in a pre-created index table in the memory table.

[0027] In one possible implementation, the local key information includes a tag identifier, and the value information includes attribute information;

[0028] Accordingly, generating an index key based on the local key information and the value information in the first key-value pair includes:

[0029] The index key is generated based on the tag identifier and the attribute information in the key information.

[0030] According to another aspect of this disclosure, a method for collecting statistical information from a graph database based on an LSM tree is provided, the method comprising:

[0031] A graph database is obtained, and each target data in the graph database is stored in a memory table based on an LSM tree according to a first key-value pair. The first key-value pair includes key information, which is generated according to a preset encoding rule matching the information statistics requirements and the data type of the target data in the graph database. The preset encoding rule includes: for each pair of target nodes and target edges with a connection relationship, the first key information of the target node and the second key information of the target edge include the same key portion. The information statistics requirements are used to indicate the statistics of nodes and edges with target features, and the key portion is used to indicate the target features. Wherein, the key information is either the first key information or the second key information.

[0032] Based on the aforementioned information statistics requirements, the nodes and edges with target characteristics are counted in the graph database.

[0033] In one possible implementation, the step of statistically analyzing nodes and edges with target characteristics in the graph database based on the information statistical requirements includes:

[0034] When merging the sorted string table file at level n of the LSM tree into the sorted string table file at level n+1, the nodes and edges with the target features are counted based on the key portion obtained by scanning the key information during the merging process; wherein, n is a positive integer, and each level string table file is obtained by merging the memory table layer by layer.

[0035] In one possible implementation, the information statistics requirement is further used to instruct the statistics of value information with the same local key information in each first key-value pair; the memory table also stores an index table, the index table including second key-value pairs consisting of an index key and an index value, the index key being generated based on the local key information and the value information in the first key-value pair, and the index value being generated based on the key information in the first key-value pair; the method further includes:

[0036] Based on the local key information obtained by scanning each index key during the merging process, the value information with the same local key information is statistically analyzed.

[0037] In one possible implementation, the method further includes:

[0038] In response to a query instruction that queries for nodes and / or edges with target value information, the index keys of each second key-value pair in the index table are scanned to obtain the target key-value pair with target value information;

[0039] In the key information of each first key-value pair in the LSM tree, the first key-value pair including the target index value in the target key-value pair is searched to obtain the node and / or edge with the value information.

[0040] In one possible implementation, after statistically analyzing the nodes and / or edges with the target characteristics in the graph database based on the information statistical requirements, the method further includes:

[0041] Update the query strategy based on statistical results.

[0042] According to another aspect of this disclosure, a data processing apparatus is provided, comprising: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing instructions stored in the memory.

[0043] According to another aspect of this disclosure, a non-volatile computer-readable storage medium is provided that stores computer program instructions thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.

[0044] According to another aspect of this disclosure, a computer program product is provided, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in a processor of an electronic device, the processor in the electronic device performs the above-described method.

[0045] In response to write requests for target data, the data type of the target data represented in the graph database is obtained. Key information of the target data is generated according to preset encoding rules and data types that match the information statistics requirements. The information statistics requirements are used to indicate the nodes and edges with target characteristics. The preset encoding rules include: for each pair of target nodes and edges with connectivity, the first key information of the target node and the second key information of the target edge include the same key portion, which indicates the target characteristics. A first key-value pair including the key information is generated. The target data is stored in an in-memory table based on the first key-value pair using an LSM tree. Nodes and edges with the same target characteristics can be counted at once by scanning the key portions of nodes and edges, without needing to parse the complete key-value pair of a node to find its adjacent edges. This solves the problem of low efficiency in traditional information statistics methods and improves the efficiency of counting nodes and edges with the same target characteristics.

[0046] In addition, by performing information statistics during the data layer merging of the LSM tree, the key information obtained during the merging process can be used for information statistics without consuming additional resources to scan the key information, thus saving the resources occupied by information statistics.

[0047] Other features and aspects of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0048] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this disclosure together with the specification and serve to explain the principles of this disclosure.

[0049] Figure 1 A flowchart illustrating a graph database storage method based on an LSM tree according to an embodiment of the present disclosure is shown.

[0050] Figure 2 A schematic diagram showing first key information and second key information according to an embodiment of the present disclosure is provided.

[0051] Figure 3 A schematic diagram showing first key information and second key information according to another embodiment of the present disclosure is shown;

[0052] Figure 4 A schematic diagram showing value information according to an embodiment of the present disclosure is shown;

[0053] Figure 5 This diagram illustrates a process for storing target data based on an embodiment of the present disclosure using an LSM.

[0054] Figure 6 A schematic diagram showing an index key and index value according to an embodiment of the present disclosure is shown;

[0055] Figure 7 A flowchart illustrating a graph database statistical information collection method based on an LSM tree according to an embodiment of the present disclosure is shown.

[0056] Figure 8 A block diagram of an LSM tree-based graph database storage device according to an embodiment of the present disclosure is shown.

[0057] Figure 9 A block diagram of a graph database statistical information acquisition device based on an LSM tree according to an embodiment of the present disclosure is shown.

[0058] Figure 10 A block diagram of an LSM tree-based graph database storage device or statistical information collection device is provided in one embodiment of this application. Detailed Implementation

[0059] Various exemplary embodiments, features, and aspects of this disclosure will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0060] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.

[0061] Furthermore, to better illustrate this disclosure, numerous specific details are set forth in the following detailed description. Those skilled in the art will understand that this disclosure can be practiced without certain specific details. In some instances, methods, means, components, and circuits well known to those skilled in the art have not been described in detail in order to highlight the main points of this disclosure.

[0062] Figure 1 A flowchart illustrating an LSM-based graph database storage method according to an embodiment of this disclosure is provided. This embodiment describes the method in an electronic device with data processing and storage capabilities, which may be a user terminal or a server. The user terminal includes, but is not limited to, computers, mobile phones, or tablet computers. This embodiment does not limit the type of electronic device or user terminal. Figure 1 As shown, the method includes:

[0063] Step 101: In response to the write request to write the target data, obtain the data type of the target data represented in the graph database.

[0064] A write request is used to write target data into a graph database. A graph database represents each piece of data as nodes and edges; correspondingly, the data types in a graph database include data representing nodes and data representing edges used to connect different nodes.

[0065] Optionally, the data type is carried in the write request. In this case, the data type of the target data is obtained by parsing the write request; or, the user-defined data type is received after the write request is obtained. This embodiment does not limit the method of obtaining the data type.

[0066] A node (Vertex) represents an entity in a graph database. Nodes are typically used to represent objects or concepts in the real world, such as people, places, and events. The specific meaning of a node is determined by the application scenario of the graph database. This embodiment does not limit the application scenario of the graph database or the meaning of the nodes.

[0067] Edges represent relationships between entities in a graph database, that is, relationships between nodes. For example, an edge can represent a friendship between two people, a geographical relationship between two locations, or a purchase relationship between goods, etc. This embodiment does not limit the meaning of edges.

[0068] Optionally, an edge can be a directed edge or an undirected edge. In the case of a directed edge, the two nodes connected by an edge can be the source and the target of the edge, respectively. Correspondingly, an edge connected to a node can be an outgoing edge or an incoming edge of that node.

[0069] In one example, nodes and / or edges have labels that describe the type or category of the node or edge. Labels help to quickly identify and query nodes or edges of a specific type. Optionally, a node or edge has one or at least two labels.

[0070] In another example, nodes and / or edges have properties that describe specific information about the node or edge. For example, properties describing a node representing a person include age and name; and properties describing an edge representing a friendship include intimacy level.

[0071] Step 102: Generate key information of the target data according to the preset encoding rules and data types that match the information statistics requirements; the information statistics requirements are used to indicate the nodes and edges with target features; the preset encoding rules include: for each group of target nodes and target edges with connection relationships, the first key information of the target node and the second key information of the target edge include the same key part, and the key part is used to indicate the target features; wherein, the key information is the first key information or the second key information.

[0072] In this embodiment, the key information of the target data is encoded according to the information statistics requirements, so that the electronic device can obtain statistical results that meet the information statistics requirements by scanning the key information, thereby improving the efficiency of information statistics.

[0073] Here, the key portion refers to a part of the key information. This portion can be a prefix of the key information or other parts of the key information. This embodiment does not limit the position of the key portion in the key information. In this embodiment, taking the key portion as a prefix of the key information as an example, in this case, during data querying and information statistics, key-value pairs with a specific prefix can be quickly found through prefix scanning, improving the efficiency of data querying and information statistics.

[0074] Alternatively, depending on the different information statistics requirements, the implementation methods for first-key information and second-key information include, but are not limited to, several:

[0075] The first implementation method: The information statistics requirement is used to indicate the edges that share the same node, and the target features include node features. Accordingly, according to the preset encoding rules and data types that match the information statistics requirement, the key information of the target data is generated, including: when the data type of the target data is a node, generating first key information based on the node identifier and data type of the target data; when the data type of the target data is an edge, obtaining the first key information of the target nodes connected by the target edge represented by the target data; and generating second key information based on the node identifier, edge identifier, and data type of the target data in the first key information.

[0076] At this point, the key portion includes a node identifier, which indicates the node's characteristics. Scanning the key information allows electronic devices to statistically analyze nodes and edges sharing the same node identifier, thus fulfilling information statistics requirements.

[0077] For example: Preset encoding rule reference Figure 2 As shown, the first key information (i.e., the key value) for any node is referenced. Figure 2 The vertex key in the data includes the node identifier vertex_id and the data type kind of the node.

[0078] For the second key information reference on either side Figure 2 The edge key includes the node identifier src_vertex_id of the starting point connected to the edge, the edge identifier edge_id of the edge, and the data type kind of the edge.

[0079] Where vertex_id is 4 bits ( Figure 2 (represented by 4B in the text), edge_id is 2 bits ( Figure 2 In Chinese, it is represented by 2B, and the data type kind is 1 bit ( Figure 2 (represented by 1B in Chinese).

[0080] according to Figure 2 It can be seen that the node identifier is the key part of the first key information and the second key information, and is a prefix of the first key information and the second key information.

[0081] The second implementation method: The information statistics requirement is used to indicate the nodes with the same label and the edges connected to each labeled node. The target features include the label features and node features of the nodes. Accordingly, according to the preset encoding rules and data types that match the information statistics requirement, the key information of the target data is generated, including: when the data type of the target data is a node, generating first key information based on the node identifier, the label identifier of the node, and the data type of the target data; when the data type of the target data is an edge, obtaining the first key information of the target nodes connected by the target edge represented by the target data; and generating second key information based on the node identifier, the label identifier of the node, the edge identifier of the target data, and the data type in the first key information.

[0082] At this point, the key portion includes the node's label identifier and node identifier. The label identifier indicates label characteristics, and the node identifier indicates node characteristics. Scanning the key information allows electronic devices to statistically analyze nodes and edges with the same label identifier and node identifier, thus fulfilling information statistics requirements.

[0083] For example: Preset encoding rule reference Figure 3 As shown, the first key information (i.e., the key value) for any node is referenced. Figure 3 The vertex key in the data includes the label ID (label_id), the vertex ID (vertex_id), and the data type (kind) of the node.

[0084] For the second key information reference on either side Figure 3 The edge key includes the label identifier src_label_id of the starting point connected to the edge, the node identifier src_vertex_id of the starting point, the data type kind of the edge, the label identifier edge_label_id of the edge, and the edge identifier edge_id of the edge.

[0085] Where vertex_id is 4 bits ( Figure 3 (represented by 4B in the text), edge_id is 2 bits ( Figure 3 In Chinese, it is represented by 2B, and the data type kind is 1 bit ( Figure 3 (represented by 1B in the text), label_id is 2 bits ( Figure 3 (This is represented by 2B in Chinese).

[0086] according to Figure 3 It can be seen that the label and node identifier of the starting point are the key parts of the first key information and the second key information, and are also the prefixes of the first key information and the second key information. Since the prefix of the edge key is the same as the prefix of its starting point, all outgoing edges of each node can be counted when performing a prefix scan of the LSM tree.

[0087] Optionally, if the edge is a directed edge, the node identifier in the second key information includes the node identifier of the starting point and the node identifier of the ending point of the edge connection; correspondingly, the key part may include the node identifier of the starting point or the node identifier of the ending point based on information statistics requirements.

[0088] For example, if the information statistics requirement is to count all outgoing edges from any starting point, then the key part includes the node identifier of the starting point.

[0089] For example, if the information statistics requirement is to count all incoming edges of any endpoint, then the key part includes the node identifier of the endpoint.

[0090] For example, if the information statistics requirement is to count all outgoing edges from any starting point and all incoming edges from any ending point, then the key part includes the node identifier of the starting point or the node identifier of the ending point. In this case, in order to determine whether the edge connected to each node is an incoming edge or an outgoing edge during the statistics, the second key information also includes an edge type identifier, which is used to indicate whether the edge is an incoming edge or an outgoing edge of the node indicated by the node identifier in the key part.

[0091] Figure 3 Taking the example where the second key information, edge key, includes the node identifier src_vertex_id of the starting point and the node identifier dst_vertex_id of the ending point, and the key part only includes the node identifier src_vertex_id of the starting point, it is possible to count all outgoing edges from any starting point. Through the second key information of the outgoing edge, the node identifier dst_vertex_id of the ending point connected by the outgoing edge can be obtained. Based on the node identifier dst_vertex_id, the ending point connected by the edge can be obtained.

[0092] Optionally, if the second key information also includes node label identifiers, then the node label identifiers in the second key information include the label identifier of the starting point and the label identifier of the ending point. Accordingly, the key portion may include the label identifier of the starting point and the node identifier based on information statistics requirements; or, it may include the label identifier of the ending point and the node identifier.

[0093] For example, if the information statistics requirement is to count each starting point with any label and all outgoing edges connected to each starting point, then the key part includes the label and node identifier of the starting point.

[0094] For example, if the information statistics requirement is to count each endpoint with any label and all incoming edges connected to each endpoint, then the key part includes the label of the endpoint and the node label.

[0095] For example, if the information statistics requirement is to count all starting points with any label and all outgoing edges connected to each starting point, and to count all ending points with that label and all incoming edges connected to each ending point, then the key part includes the label and node identifier of the starting point, or the label and node identifier of the ending point. In this case, in order to determine whether the edge connected to a given node is an incoming edge or an outgoing edge during the statistics, the second key information also includes an edge type identifier.

[0096] Figure 4 Taking the second key information edge key as an example, which includes the label identifier src_label_id and node identifier src_vertex_id of the starting point, the label identifier dst_label_id and node identifier dst_vertex_id of the ending point, and the data type kind, we can achieve the following: the key part includes the label identifier src_label_id and node identifier src_vertex_id of the starting point. In this case, we can count all starting points with a certain label identifier and all outgoing edges connected to each starting point. Through the second key information of the outgoing edge, we can obtain the label identifier dst_vertex_id of the ending point connected to the outgoing edge. Based on the label identifier dst_vertex_id, we can obtain the ending point connected to the edge.

[0097] Step 103: Generate the first key-value pair including key information.

[0098] Optionally, the encoding rules for the value information of nodes and edges may be the same or different. This embodiment does not limit the encoding rules for value information. In this embodiment, the example of the encoding rules for the value information of nodes and edges being the same is used for illustration. The encoding rules include: generating value information based on attribute information.

[0099] refer to Figure 5 The value information shown includes the attribute identifier `property_id` and the attribute value `property_val`. In other words, the value information of a node includes the node's attribute identifier `property_id` and attribute value `property_val`; the value information of an edge includes the edge's attribute identifier `property_id` and attribute value `property_val`. Therefore, the attribute information includes both the attribute identifier and the attribute value.

[0100] In other embodiments, the value information may be other information, and this embodiment does not limit the way the value information is generated.

[0101] For each node, the node's first key information and the node's value information constitute the node's first key-value pair; for each edge, the edge's second key information and the edge's value information constitute the edge's first key-value pair.

[0102] Step 104: Store the target data into a memory table based on the first key-value pair using the LSM tree.

[0103] refer to Figure 6 The process shown is based on LSM for storing target data. The target data is sorted according to its key information and first stored in the LSM tree's memtable. The write operation is recorded in the Write-Ahead Logging (WAL). When the memtable is full, the memory block is transformed into an immutable memtable. At the same time, a new memtable is generated for subsequent write requests, and the contents of the immutable memtable are written to the level 0 Sorted String Table (SSTable) file on disk. When the level 0 SSTable file is full, it is merged into the level 1 SSTable file. When the level 1 SSTable file is full, it is merged into the level 2 SSTable file, and so on, until it is merged into the highest level SSTable file of the LSM tree.

[0104] In summary, the LSM tree-based graph database storage method provided in this embodiment obtains the data type of the target data in the graph database in response to a write request for writing target data; generates key information of the target data according to a preset encoding rule and data type that matches the information statistics requirements; the information statistics requirements are used to indicate the nodes and edges with target features; the preset encoding rule includes: for each pair of target nodes and target edges with a connection relationship, the first key information of the target node and the second key information of the target edge include the same key part, and the key part is used to indicate the target features; a first key-value pair including the key information is generated; the target data is stored in a memory table according to the first key-value pair based on the LSM tree; nodes and edges with the same target features can be counted at once by scanning the key parts of nodes and edges, without having to parse the complete key-value pair of the node to find the edge adjacent to the node; this solves the problem of low efficiency in traditional information statistics methods; and can improve the efficiency of counting nodes and edges with the same target features.

[0105] In addition, by setting the key part as a prefix of the key information of nodes or edges, information statistics can be performed during the prefix scan of the LSM tree, further improving the efficiency of information statistics.

[0106] Optionally, the information statistics requirement may also be used to perform statistics on the value information in the first key-value pair. In this case, the complete first key-value pair needs to be parsed to achieve the statistics on the value information, resulting in low information statistics efficiency. Based on this, if the information statistics requirement is also used to instruct the statistics on the value information with the same local key information in each first key-value pair, then after step 103, the method further includes: generating an index key based on the local key information and value information in the first key-value pair; generating an index value based on the key information in the first key-value pair; and storing the second key-value pair composed of the index key and the index value into a pre-created index table in the memory table.

[0107] At this point, by scanning the index key through an additional index table, value information with the same local key information can be found without parsing the complete key-value pairs, which can improve the efficiency of statistically analyzing value information with the same local key information.

[0108] The local key information refers to a part of the key information in the first key-value pair. This part can be the key part in the above embodiment, or it can be other parts of the key information besides the key part, or it can be a part of the key part and / or other parts. This embodiment does not limit the implementation method of the local key information.

[0109] In one example, the information statistics requirement is also used to instruct the statistics of attribute information with the same label in each first key-value pair. That is, the local key information includes the label identifier, and the value information includes the attribute information. Accordingly, based on the local key information and value information in the first key-value pair, an index key is generated, including: generating the index key based on the label identifier and attribute information in the key information.

[0110] refer to Figure 6 The index key includes the label identifier (label_id) and the attribute identifier (property_id) and attribute value (property_val) from the attribute information. The index value includes the first key information of the node (vertexkey) or the second key information of the edge (edgekey). The index key determines the index value, and the index value determines the specific node or edge.

[0111] exist Figure 7 In the LSM tree, the mem-comparable property value in the index key means that the property value will be converted into a memory-comparable form, thereby enabling the index table to be sorted by key value and range scan of index key in the LSM tree.

[0112] In this embodiment, when the information statistics requirement is also used to indicate the value information with the same local key information in each first key-value pair, an index key is generated based on the local key information and value information in the first key-value pair; an index value is generated based on the key information in the first key-value pair; and the second key-value pair composed of the index key and index value is stored in a pre-created index table in the memory table. This avoids the problem of low information statistics efficiency due to the need to parse the complete first key-value pair, thus improving the efficiency of value information statistics.

[0113] Figure 1 A flowchart illustrating a graph database statistical information collection method based on an LSM tree according to an embodiment of the present disclosure is provided. This embodiment describes the method in an electronic device with data processing and storage capabilities, which can be a user terminal or a server. The user terminal includes, but is not limited to, computers, mobile phones, or tablet computers. This embodiment does not limit the type of electronic device or user terminal. Optionally, the electronic device in this embodiment and... Figure 7 The electronic devices mentioned can be the same device or different devices. For example... Figure 5 As shown, the method includes:

[0114] Step 701: Obtain the graph database.

[0115] Each target data in the graph database is stored in an in-memory table based on an LSM tree according to a first key-value pair. The first key-value pair includes key information, which is generated according to a preset encoding rule that matches the information statistics requirements and the data type of the target data in the graph database. The preset encoding rule includes: for each pair of target nodes and target edges with a connection relationship, the first key information of the target node and the second key information of the target edge include the same key part. The information statistics requirements are used to indicate the nodes and / or edges with target features, and the key part is used to indicate the target features. The key information is either the first key information or the second key information.

[0116] For a detailed description of this step, please refer to the above-described embodiment of the graph database storage method. This embodiment will not repeat the details here.

[0117] Step 702: Based on the information statistics requirements, count the nodes and edges with target features in the graph database.

[0118] Optionally, in response to an information statistics command, the electronic device performs statistics on nodes and edges with target characteristics in the graph database based on the information statistics requirements. The information statistics command may be generated upon receiving an information statistics operation performed by a user.

[0119] Alternatively, when merging the sorted string table file at level n of the LSM tree into the sorted string table file at level n+1, the nodes and edges possessing the target characteristics are counted based on the key portion obtained from scanning the key information during the merging process. Since the merging process already involves scanning key information, this embodiment reuses the results (including the key portion) obtained from scanning the key information for information statistics, eliminating the need for additional resources to scan key information and saving resources consumed in the information statistics process. (See reference...) Figure 5 The merging process shown is based on Figure 6 It can be seen that information statistics are performed when merging the current level sorted string table file into the higher level sorted string table file.

[0120] Where n is a positive integer, and each level of the string table file is obtained by merging the memory tables layer by layer.

[0121] Optionally, during the file merging process of the LSM tree, the key information is scanned using a prefix scan method to merge the data, and a Bloom filter is constructed using the hash value of the key part in the key information to obtain a prefix filter for counting nodes and edges with a certain target feature.

[0122] Schematic, based on the key portion obtained from scanning key information during the merging process, nodes and edges possessing the target characteristics are counted, including:

[0123] For each graph element in the graph database (the graph element is a node or an edge), the key part of the graph element is obtained by prefix scanning.

[0124] In the prefix Bloom filter, determine whether the position indicated by the hash value of the key portion is 1;

[0125] If it is 1, then obtain the key information to which the key part belongs, determine the data type of the graphic element according to the data type in the key information, and increment the statistical information of the data type corresponding to the key part by 1;

[0126] If it is not 1, then in the prefix Bloom filter, the position indicated by the hash value of the key part is marked as 1, the key information to which the key part belongs is obtained, the data type of the graphic element is determined according to the data type in the key information, and the statistical information of the data type corresponding to the key part is incremented by 1.

[0127] In this context, the values ​​at each position of the prefix Bloom filter are initialized to 0, and the statistical information of the data type corresponding to each key part is initialized to 0.

[0128] For example, information statistics requirements are used to indicate the number of edges connected to the same node. The target features include node features, and the key part includes node identifiers, which are used to indicate the node features. Accordingly, through the above statistical process, a Map structure is maintained during the merging process. Each time a node identifier vertex_id is scanned and its data type kind represents the edge key, the edge count under vertex_id is incremented by 1 in the Map structure, thus obtaining the number of edges connected to each node.

[0129] For example, information statistics requirements are used to indicate the nodes with the same label and the edges connected to each node with that label. The target features include the node's label features and node features. The key part includes the node identifier and the node's label identifier. The node identifier indicates the node's features, and the label identifier indicates the node's label features. Accordingly, through the above statistical process, a Map structure is maintained during the merging process. Each time a key representing an edge (data type `kind`) under a label identifier `label_id` is scanned, the edge count under that label_id is incremented by 1 in the Map structure; similarly, each time a key representing a node (data type `kind`) under a node identifier `vertex_id` is scanned, the node count under that label_id is incremented by 1 in the Map structure. This process yields the nodes with the same label and the edges connected to each node with that label.

[0130] In other embodiments, the scanning method during the merging process of LSM tree executable files can also be other methods, such as key scanning, i.e., scanning the entire key; or, Bloom filters are not used during the scanning process. This embodiment does not limit the scanning method during the merging process of LSM tree executable files.

[0131] Optionally, the information statistics requirement is also used to instruct the statistics of value information with the same local key information in each first key-value pair; correspondingly, the memory table also stores an index table, which includes second key-value pairs consisting of index keys and index values. The index keys are generated based on the local key information and value information in the first key-value pairs, and the index values ​​are generated based on the key information in the first key-value pairs. The construction process of the index table is described in the above storage method embodiment, and will not be repeated here. Correspondingly, the information statistics method also includes:

[0132] Based on the local key information obtained by scanning each index key during the merging process, the value information with the same local key information is statistically analyzed.

[0133] For example, information statistics requirements are also used to instruct on the statistics of attribute information with the same label in each first key-value pair, for reference. Figure 6The index key includes the label identifier (label_id) and the attribute identifier (property_id) and attribute value (property_val) from the attribute information. By scanning the index key, you can obtain different attribute information under the same label identifier (label_id).

[0134] Optionally, if the index table is obtained, the electronic device can also support a specified query for a certain value information. In this case, the information statistics method further includes: in response to a query instruction to query nodes and / or edges with target value information, scanning the index keys of each second key-value pair in the index table to obtain target key-value pairs with target value information; and searching for first key-value pairs that include the target index value in the target key-value pairs in the key information of each first key-value pair in the LSM tree to obtain nodes and / or edges with value information.

[0135] For example: still with Figure 8 Taking the value information shown as attribute information as an example, if the query instruction indicates that the node and edge with the target attribute information are to be queried, the index key in the index table that matches the target attribute information can be determined by scanning the index key in the index table. The index value in the target key-value pair where the index key is located can be obtained, that is, the node identifier of the node with the target attribute information and the edge identifier of the edge can be obtained. By scanning the key in each first key-value pair, the node indicated by the node identifier and the edge indicated by the edge identifier can be obtained.

[0136] Optionally, after step 702, that is, after counting nodes and / or edges with target characteristics in the graph database based on information statistics requirements, the method further includes: updating the query strategy based on the statistical results.

[0137] After obtaining the statistical results, the electronic device can output the statistical results; and / or, it can update the query strategy based on the statistical results.

[0138] In one example, if the statistical results indicate that the number of edges corresponding to the target feature is greater than a first edge count threshold, then the first indexing strategy is used when querying the edges corresponding to that target feature; if the statistical results indicate that the number of edges corresponding to the target feature is less than or equal to a second edge count threshold, then the second indexing strategy is used when querying the edges corresponding to that target feature; wherein, the indexing efficiency of the first indexing strategy is higher than the indexing efficiency of the second indexing strategy. The first edge count threshold is greater than or equal to the second edge count threshold.

[0139] For example, the first indexing strategy is based on B-tree indexing, and the second indexing strategy is linear search. This embodiment does not limit the implementation of the first and second indexing strategies.

[0140] In another example, if the statistical results indicate that the number of edges corresponding to the target feature is greater than the third edge count threshold, the query strategy is set to start querying the edges corresponding to the target feature from the hot data layer; if the statistical results indicate that the number of edges corresponding to the target feature is less than or equal to the fourth edge count threshold, the query strategy is set to start querying the edges corresponding to the target feature from the cold data layer.

[0141] Among them, the hot data layer refers to the data layer with the most frequent access to the LSM tree, and the cold data layer refers to the data layer with the least frequent access to the LSM tree; the threshold for the number of third sides is greater than or equal to the threshold for the number of fourth sides.

[0142] In other embodiments, the query strategy can be updated based on statistical results in other ways, which will not be listed here.

[0143] Optionally, the statistical results of information statistics can be represented by a histogram of equal width. For example, a histogram of equal width can be used to represent the statistical results of different attribute information under the same label identifier label_id. In other embodiments, it can also be represented by a line chart. This embodiment does not limit the way the statistical results are represented.

[0144] In summary, the LSM-based graph database statistical information collection method provided in this embodiment acquires a graph database, and stores each target data in the graph database into a memory table based on an LSM tree according to a first key-value pair. The first key-value pair includes key information, which is generated according to a preset encoding rule matching the information statistical requirements and the data type of the target data in the graph database. The preset encoding rule includes: for each group of target nodes and target edges with a connection relationship, the first key information of the target node and the second key information of the target edge include the same key part. The information statistical requirements are used to indicate the nodes and edges with target features, and the key part is used to indicate the target features. Based on the information statistical requirements, the nodes and edges with target features are counted in the graph database. Nodes and edges with the same target features can be counted at once by scanning the key parts of nodes and edges, without having to parse the complete key-value pair of the node to find the edge adjacent to the node. This solves the problem of low efficiency in traditional information statistical methods and can improve the efficiency of counting nodes and edges with the same target features.

[0145] In addition, by performing information statistics during the data layer merging of the LSM tree, the key information obtained during the merging process can be used for information statistics without consuming additional resources to scan the key information, thus saving the resources occupied by information statistics.

[0146] In addition, updating query strategies based on statistical results can improve the efficiency of data retrieval.

[0147] Figure 9This is a block diagram of a graph database storage device based on an LSM tree according to an embodiment of this application. The device includes: a type acquisition module 810, a key generation module 820, a key-value pair generation module 830, and a data storage module 840.

[0148] The type acquisition module 810 is used to acquire the data type represented by the target data in the graph database in response to a write request for writing target data; wherein, the data type in the graph database includes nodes and edges for connecting different nodes;

[0149] The key generation module 820 is used to generate key information of the target data according to a preset encoding rule and the data type that matches the information statistics requirements; the information statistics requirements are used to indicate the nodes and edges with target features; the preset encoding rule includes: for each group of target nodes and target edges with a connection relationship, the first key information of the target node and the second key information of the target edge include the same key part, and the key part is used to indicate the target features; wherein, the key information is the first key information or the second key information;

[0150] Key-value pair generation module 830 is used to generate a first key-value pair including the key information;

[0151] The data storage module 840 is used to store the target data into a memory table based on the LSM tree and the first key-value pair.

[0152] For relevant details, please refer to the above method implementation examples.

[0153] Figure 10 This is a block diagram of a graph database statistical information collection device based on an LSM tree according to an embodiment of this application. The device includes a database acquisition module 910 and an information statistics module 920.

[0154] Database acquisition module 910 is used to acquire a graph database, wherein each target data in the graph database is stored in a memory table based on an LSM tree according to a first key-value pair; the first key-value pair includes key information, which is generated according to a preset encoding rule matching information statistics requirements and the data type of the target data in the graph database; the preset encoding rule includes: for each group of target nodes and target edges with a connection relationship, the first key information of the target node and the second key information of the target edge include the same key part; the information statistics requirements are used to indicate the statistics of nodes and edges with target features, and the key part is used to indicate the target features; wherein, the key information is either the first key information or the second key information;

[0155] The information statistics module 920 is used to count nodes and edges with target features in the graph database based on the information statistics requirements.

[0156] For relevant details, please refer to the above method implementation examples.

[0157] In some embodiments, the functions or modules of the apparatus provided in this disclosure can be used to perform the methods described in the above method embodiments. The specific implementation can be referred to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.

[0158] This disclosure also proposes a computer-readable storage medium storing computer program instructions that, when executed by a processor, implement the above-described method. The computer-readable storage medium can be volatile or non-volatile.

[0159] This disclosure also proposes an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored in the memory. Examples of electronic devices can be found in [link to relevant documentation]. Figure 10 .

[0160] This disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in a processor of an electronic device, the processor in the electronic device performs the above-described method.

[0161] Figure 10 This application provides a block diagram of an LSM-based graph database storage device or statistical information collection device according to one embodiment. For example, device 1900 can be provided as a server or terminal device. (Refer to...) ​ The apparatus 1900 includes a processing component 1922, which further includes one or more processors, and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 1922 is configured to execute instructions to perform the methods described above.

[0162] Device 1900 may also include a power supply component 1926 configured to perform power management of device 1900, a wired or wireless network interface 1950 configured to connect device 1900 to a network, and an input / output interface 1958 (I / O interface). Device 1900 can operate on an operating system, such as Windows Server, stored in memory 1932. TM macOS X TM Unix TM LinuxTM FreeBSD TM Or similar.

[0163] In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions that can be executed by a processing component 1922 of the device 1900 to perform the above-described method.

[0164] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A graph database storage method based on LSM trees, characterized in that, The method includes: In response to a write request to write target data, the data type of the target data represented in the graph database is obtained; wherein, the data type in the graph database includes nodes and edges for connecting different nodes; According to a preset encoding rule matching the information statistics requirements and the data type, key information of the target data is generated; the information statistics requirements are used to indicate the nodes and edges with target features; the preset encoding rule includes: for each group of target nodes and target edges with connection relationships, the first key information of the target node and the second key information of the target edge include the same key part, the key part is used to indicate the target features; wherein, the key information is the first key information or the second key information; Generate a first key-value pair including the key information; Based on the LSM tree, the target data is stored in a memory table according to the first key-value pairs; the information statistics requirement is also used to instruct the statistics of value information with the same local key information in each first key-value pair; correspondingly, the method further includes: An index key is generated based on the local key information and the value information in the first key-value pair; Generate an index value based on the key information in the first key-value pair; The second key-value pair, consisting of the index key and the index value, is stored in a pre-created index table in the memory table.

2. The method according to claim 1, characterized in that, The information statistics requirement is used to indicate the statistics of edges that share the same node, and the target features include node features; accordingly, The step of generating key information for the target data according to a preset encoding rule matching the information statistics requirements and the data type includes: When the data type of the target data is a node, the first key information is generated based on the node identifier of the target data and the data type. When the data type of the target data is an edge, obtain the first key information of the target node connected by the target edge represented by the target data; generate the second key information based on the node identifier in the first key information, the edge identifier of the target data, and the data type; The key portion includes the node identifier, which is used to indicate the node characteristics.

3. The method according to claim 1, characterized in that, The information statistics requirement is used to indicate the statistics of nodes with the same label and the edges connected to each node with the label. The target features include the label features and node features of the nodes. Accordingly, The step of generating key information for the target data according to a preset encoding rule matching the information statistics requirements and the data type includes: When the data type of the target data is a node, the first key information is generated based on the node identifier, the node tag identifier, and the data type of the target data; When the data type of the target data is an edge, obtain the first key information of the target nodes connected by the target edge represented by the target data; generate the second key information based on the node identifier, the node label identifier, the edge identifier of the target data, and the data type in the first key information; The key portion includes the tag identifier of the node and the node identifier, wherein the tag identifier is used to indicate the tag feature and the node identifier is used to indicate the node feature.

4. The method according to any one of claims 1 to 3, characterized in that, The key portion is a prefix of the key information.

5. The method according to claim 1, characterized in that, The local key information includes a tag identifier, and the value information includes attribute information; Accordingly, generating an index key based on the local key information and the value information in the first key-value pair includes: The index key is generated based on the tag identifier and the attribute information in the key information.

6. A method for collecting statistical information from a graph database based on an LSM tree, characterized in that, The method includes: A graph database is obtained, and each target data in the graph database is stored in a memory table based on an LSM tree according to a first key-value pair. The first key-value pair includes key information, which is generated according to a preset encoding rule matching the information statistics requirements and the data type of the target data in the graph database. The preset encoding rule includes: for each pair of target nodes and target edges with a connection relationship, the first key information of the target node and the second key information of the target edge include the same key portion. The information statistics requirements are used to indicate the statistics of nodes and edges with target features, and the key portion is used to indicate the target features. Wherein, the key information is either the first key information or the second key information. Based on the aforementioned information statistics requirements, the method statistically analyzes nodes and edges with target features in the graph database; the information statistics requirements are also used to instruct the statistical analysis of value information with the same local key information in each first key-value pair; the memory table also stores an index table, which includes second key-value pairs consisting of an index key and an index value, wherein the index key is generated based on the local key information and the value information in the first key-value pair, and the index value is generated based on the key information in the first key-value pair; the method further includes: When merging the sorted string table file at level n of the LSM tree into the sorted string table file at level n+1, the value information with the same local key information is counted based on the local key information obtained by scanning each index key during the merging process; wherein, n is a positive integer, and each level string table file is obtained by merging the memory table layer by layer.

7. The method according to claim 6, characterized in that, The statistical analysis based on the information requirement involves statistically analyzing the nodes and edges with target features in the graph database, including: When merging the sorted string table file at level n of the LSM tree into the sorted string table file at level n+1, the nodes and edges with the target features are counted based on the key portion obtained by scanning the key information during the merging process.

8. The method according to claim 6, characterized in that, The method further includes: In response to a query instruction that queries for nodes and / or edges with target value information, the index keys of each second key-value pair in the index table are scanned to obtain the target key-value pair with target value information; In the key information of each first key-value pair in the LSM tree, the first key-value pair including the target index value in the target key-value pair is searched to obtain the node and / or edge with the value information.

9. The method according to any one of claims 6 to 8, characterized in that, After statistically analyzing the nodes and / or edges with target features in the graph database based on the aforementioned information requirements, the method further includes: Update the query strategy based on statistical results.

10. A data processing apparatus, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to, when executing instructions stored in the memory, implement the LSM tree-based graph database storage method according to any one of claims 1 to 5; or implement the LSM tree-based graph database statistical information collection method according to any one of claims 6 to 9.

11. A non-volatile computer-readable storage medium storing computer program instructions thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the graph database storage method based on LSM tree as described in any one of claims 1 to 5; or implement the graph database statistical information collection method based on LSM tree as described in any one of claims 6 to 9.