[0028] The implementation process of the present invention will be described in detail below in conjunction with the embodiments and the drawings of the specification.
[0029] The relational database query method based on ontology and restricted natural language processing of the present invention includes the following 6 steps:
[0030] 1) Ontology to PCDO graph data structure: Ontology is a specification proposed by the World Wide Web Consortium (W3C) to describe various resource information on the World Wide Web. The ontology in the present invention is based on the schema information of the relational database. It is constructed by certain rules and used to describe various resource information in the database. The ontology construction rules are as follows:
[0031] (A) Constructing classes in the ontology: For all the relational tables in the relational database, a corresponding class is constructed in the ontology, and the class and the relational table are in one-to-one correspondence;
[0032] (B) Construct a data type property in the ontology (dataTypeProperty): For each column c in each relation table t, construct a data type property corresponding to c in the ontology, and the class to which the data type property belongs is the relation The class corresponding to table t;
[0033] (C) Construct objectProperty in the ontology: For each foreign key f that associates two relational tables t1 and t2 in the relational database, a f is constructed between the two classes corresponding to the relational tables t1 and t2 Corresponding object properties, the two classes connected by the object properties are the classes corresponding to the relationship tables t1 and t2.
[0034] The ontology only contains class, object attributes and data type attributes. In order to make full use of the information in the ontology, we propose the PCDO graph data structure (hereinafter referred to as PCDO graph). The PCDO graph mainly contains two sub-data structures, namely Node and Edge.
[0035] The data structure of the node is shown in Table 1:
[0036] Table 1 Data structure of the node
[0037]
[0038] The data structure of the node contains six attributes: Type, Name, Edges, Keyword, Value, and KeywordType. The Type attribute is used to identify the type of the node. The node types include class nodes (C_Node) and attribute nodes (P_Node); the Name attribute is used to identify the name of the node; the Edges attribute is used to record the neighboring nodes All edges; Keyword attribute, Value attribute, and KeywordType attribute are all blanked in the conversion process of step 1), and the initialization process of these three attributes is described in detail in step 4).
[0039] The data structure of the edge is shown in Table 2:
[0040] Table 2 side data structure
[0041]
[0042] The edge data structure contains four attributes: Type, Name, Node1, and Node2. The Type attribute is used to identify the type of edge. The edge types include data type attribute edge (D_Edge) and object attribute edge (O_Edge); Name attribute is used to identify the name of the edge; Node1 and Node2 are used to record the current edge connection Two nodes. Because the PCDO graph is an undirected graph, the attribute values of the two attributes of Node1 and Node2 are interchangeable.
[0043] The conversion steps from the body to the PCDO diagram:
[0044] (1) Conversion of class node (C_Node): All classes in the ontology are converted to a node, the Type attribute of the node is set to "C_Node", the Name attribute of the node is set to the name of the corresponding class, The Edges property is set to empty;
[0045] (2) Conversion of attribute node (P_Node): All the data type attributes in the ontology are respectively converted into a node, the Type attribute of the node is set to "P_Node", and the Name attribute of the node is set to the corresponding data type attribute Name, the Edges property of the node is set to empty;
[0046] (3) Conversion of data type attribute edge (D_Edge): add an edge between attribute node P and class node C to which the attribute node belongs, set the Type attribute of the edge to "D_Edge", and the edge Node1 and Node2 The properties are set to P and C respectively, the Name property of the data type property side is set to "hasProperty", and the currently converted data type property side is added to the Edges property of node P and node C respectively;
[0047] (4) Conversion of object attribute edges (O_Edge): All object attributes in the ontology are respectively converted into an edge between the two class nodes C1 and C2 connected by it, and the Type attribute of the edge is set to "O_Edge", The attributes of Node1 and Node2 are respectively set to C1 and C2, the Name attribute of the edge is set to the name of the corresponding object attribute, and the currently converted object attribute edges are added to the Edges attributes of nodes C1 and C2 respectively.
[0048] 2) Build a special dictionary and keyword index for word segmentation:
[0049] Traverse all the relational tables in the database, read each record in the relational table in turn, write the record value into the dictionary, and use the record value as the key at the same time, and the corresponding relational table name and column name of the record value in the database As a value, a key-value pair is formed as a keyword index. In a preferred embodiment of the present invention, the non-relational database adopts the MongoDB database, and the key-value pairs are stored in the MongoDB database as a keyword index. Of course, the method of the present invention is not limited to the MongoDB database, and all non-relational (NoSQL) databases can be used here.
[0050] The structure and examples of key-value pairs are shown in Table 3 below. For example, there is a record under the "Name" column in the "Students" table as "Zhang San". When the record of "Zhang San" is read, "Zhang San" is used as the key, and the TableName property is set to " The name of the relation table corresponding to "Zhang San" is "student", and the ColumnName attribute is set to the column name "name" of the column corresponding to "Zhang San" to form a key-value pair and store it in the keyword index. After the dictionary is built, it is saved in the form of a file on the disk, and every time the dictionary is used, it is read from the corresponding location in the disk. The purpose of constructing a special dictionary for word segmentation is to decompose keywords from natural language queries input by users.
[0051] Table 3 Key-value pair structure and examples
[0052]
[0053] In the process of traversing each relational table, the table name and all column names of the relational table are used as keys at the same time to construct the key-value pairs shown in Table 3 and store them in the keyword index: For the table name of the relational table, Set the TableName property to "table" and set the ColumnName property to the table name of the corresponding relationship table; for the column name of the relationship table, set the TableName property to "column", and set the ColumnName property to the column name of the corresponding column.
[0054] In addition, a keyword may correspond to multiple elements in the database. For example, the above "Zhang San" may also correspond to another column of another table, and this key-value pair also needs to be stored in the keyword index. When the user queries "Zhang San", by querying the keyword index, all table names and column names corresponding to the word "Zhang San" can be quickly obtained.
[0055] 3) Use a dedicated dictionary for word segmentation: After the system receives the user's natural language query, it uses the dedicated dictionary constructed in step 2) to decompose the natural language into multiple meaningful keywords;
[0056] 4) Combine figure 2 Explain the construction of the query space: for each keyword decomposed in step 3), by querying the keyword index, all the relational table names and column names corresponding to the word in the database can be obtained. According to step 1), a relation table name corresponds to a class in the ontology, which corresponds to a class node in the PCDO diagram; a column name in the relation table corresponds to a data type attribute in the ontology, and the data type attribute also corresponds to An attribute node in the PCDO diagram. When a keyword corresponds to multiple relational table names and column names, it also corresponds to multiple nodes in the PCDO diagram. Hereinafter, the node in the PCDO graph corresponding to the keyword is called the mapping node of the keyword.
[0057] The Keyword attribute of the mapping node is set to the corresponding keyword. If there are multiple mapping nodes for a keyword, use the Value attribute to distinguish, and Value can be a different number. The KeywordType attribute includes three values "table", "column", and "value", which are determined according to the keyword index. If the TableName value in the keyword index is "table", the KeywordType attribute is set to "table"; if the TableName value in the keyword index is "column", the KeywordType attribute is set to "column"; in other cases, the KeywordType attribute is set Is "value".
[0058] The connected components of all mapping nodes corresponding to all keywords are extracted from the PCDO graph constructed in step 1) as the query space. Because the query space must be a subset of the PCDO graph, and after finding the connected components of all mapping nodes through the keyword index, the search range can be effectively reduced.
[0059] 5) Combine image 3 Explain the connected subgraph search method: According to the query space constructed in step 4), search for connected subgraphs containing all keywords. The main steps are as follows:
[0060] a) Randomly select an unprocessed connected component in the query space, find all the mapping nodes in the connected component, and put it into the set X.
[0061] b) If there are n (n≥2) mapping nodes Node1 and Node in the set X 2 …Node n If the Keyword attribute is the same, the set X will be expanded into n sets X according to the different Value attributes 1 , X 2 …X n , And then divide X by Node 1 , Node 2 …Node n All other mapping nodes are added to X 1 , X 2 …X n And delete set X.
[0062] c) For X 1 , X 2 …X n Repeat step b) repeatedly until the Keyword attribute of each mapping node in each set is different, and finally m non-extendable sets are obtained;
[0063] d) Randomly select one of the m sets and mark it as W, and arbitrarily select a mapping node as the initial node for breadth first search (BFS);
[0064] e) In the BFS traversal process, if a new mapping node is encountered, the path from this node to the initial node is recorded in the set Set, and the newly encountered mapping node is deleted from W;
[0065] f) Repeat e) until all the mapping nodes in the set W have been traversed. At this time, the records in the set Set record the nodes and edges that connect all the mapping nodes in W together, that is, all the nodes and edges in an association W are searched Map the path of the node, and the path is a subset of the PCDO graph constructed in step 1). In the following, such connected paths are called PCDO subgraphs;
[0066] g) Repeat d) until all the collections are processed;
[0067] h) Repeat a) until all sub-query spaces are processed;
[0068] i) Sort all the obtained PCDO subgraphs according to the number of keywords contained in them from large to small. For PCDO subgraphs with the same number of keywords, then according to the number of edges included, from small to large Sort, and finally select the top k PCDO subgraphs in the sort. k needs to determine an appropriate value according to the size of the specific database, or be specified by the user, here only represents an appropriate number;
[0069] 6) Generate SQL statements: According to step 5), k PCDO subgraphs are obtained, and the PCDO subgraphs are converted into SQL statements. The SQL statement format is as follows:
[0070] select
[0071] from
[0072] where
[0073] The conversion rules from PCDO subgraph to SQL:
[0074] a) Fill in "*" in the select clause, which means that all columns in the database that meet the query conditions will be returned to the user;
[0075] b) The from clause fills in the corresponding relation table according to all the nodes in the PCDO subgraph;
[0076] c) The where clause fills in the corresponding foreign key relationship based on the object attribute side of the PCDO subgraph (if there is no object attribute side, leave it blank);
[0077] d) The where clause fills in the value corresponding to the attribute node based on the attribute node in the PCDO subgraph.
[0078] After the above SQL statement is generated, the database is queried through the database query interface, and the result is finally returned to the user.
[0079] The following describes the implementation process of the present invention in detail with a simplified application example:
[0080] 1) Ontology to PCDO diagram: The relational database involved in this example is a database that records information about students' course selection. The data contains the following information:
[0081] Relationship table: students, electives, courses, colleges
[0082] Listed: Student. Name, Student. Student ID, Course Selection. Course Code, Course Selection. Student ID, Course. Department Code, Course. Course Code, College. Department Code, College. Department Remarks
[0083] Foreign key: Course selection. Student ID=student. Student ID, course selection. Course code=course. Course code, course. Department code=College. Department code
[0084] Therefore, the ontology extracted from the schema information of the relational database is an ontology describing the relevant information of students' course selection. The detailed information contained in the ontology is as follows:
[0085] Category: student, course selection, course, college;
[0086] Data type attributes: student_name, student_student ID, course selection_course code, course selection_student ID, course_department code, course_course code, college_department code, college_department remarks;
[0087] Object attributes: student_student ID_selection_student ID, course selection_course code_course_course code, course_department code_college_department code.
[0088] The ontology converted from the relational model is as follows:
[0089]
[0090]
[0091] xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
[0092] xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
[0093] xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
[0094] xmlns:owl="http://www.w3.org/2002/07/owl#"
[0095] xmlns="http://www.project.com/d2o_owl#"
[0096] xml:base="http://www.project.com/d2o_owl">
[0097]
[0098]
[0099]
[0100]
[0101]
[0102]
[0103]
[0104]
[0105]
[0106]
[0107]
[0108]
[0109]
[0110]
[0111]
[0112]
[0113]
[0114]
[0115]
[0116]
[0117]
[0118]
[0119]
[0120]
[0121]
[0122]
[0123]
[0124]
[0125]
[0126]
[0127]
[0128]
[0129]
[0130]
[0131]
[0132]
[0133]
[0134]
[0135]
[0136]
[0137]
[0138]
[0139]
[0140]
[0141]
[0142]
[0143]
[0144]
[0145]
[0146]
[0147]
[0148] Use the conversion rules in step 1) to convert the ontology into a PCDO graph data structure, such as Figure 4 Shown.
[0149] The classes in the ontology are converted to the class nodes in the PCDO diagram, that is, the elliptical nodes in the diagram; the data type attributes in the ontology are converted to the attribute nodes in the PCDO diagram, that is, the rectangular nodes in the diagram; the attribute nodes and it The corresponding class nodes are connected by a hasProperty edge; the object properties in the ontology are transformed into edges connecting two class nodes.
[0150] 2) Build a special dictionary and keyword index for word segmentation. The contents of the keyword index are as follows:
[0151] "student":{
[0152] "TableName":"table",
[0153] "ColumnName":"Student"}
[0154] "course":{
[0155] "TableName":"table",
[0156] "ColumnName":"Course"}
[0157] "College": {
[0158] "TableName":"table",
[0159] "ColumnName":"College"}
[0160] "student ID":{
[0161] "TableName":"column",
[0162] "ColumnName":"Student"}
[0163] "09131011":{
[0164] "TableName": "Student"
[0165] "ColumnName":"Student ID",}
[0166] Among them, in the key-value pair corresponding to the key "student", the TableName attribute is "table" and the ColumnName attribute is "student", which means that "student" corresponds to a relational table here, and the relational table is named "student";
[0167] In the key-value pair corresponding to the key "student ID", the TableName field is "column" and the ColumnName field is "student", which means that the "student ID" corresponds to a column name here, and the column name is "student ID";
[0168] The ColumnName field corresponding to the key "09131011" is "Student ID", and the TableName field is "Student", which means that the key "09131011" corresponds to a specific value under the "Student ID" column in the "Student" table.
[0169] 3) Word segmentation: For example, the user enters the query sentence "Find the college of the course selected by the student with student number 09131011", and after word segmentation, five meaningful keywords are obtained: student number, 09131011, student, course, college.
[0170] 4) Constructing the query space: Using the keyword index obtained in 2), the mapping relationship between keywords and nodes in the PCDO graph can be obtained as follows:
[0171] Table 4 Keyword mapping PCDO graph node
[0172]
[0173] According to the mapping relationship between keywords and PCDO graph nodes, a query space such as Figure 5 As shown (the node in bold is the node to which the keyword is mapped).
[0174] 5) Connected subgraph search: According to the search algorithm, find one or more connected subgraphs that can connect all keywords together in the PCDO graph. The search results are as follows Image 6 Shown.
[0175] 6) Generate SQL statements according to the conversion rules from PCDO subgraph to SQL:
[0176] Fill "*" in the select clause to get the select clause: select*;
[0177] Fill in the name of the relation table corresponding to the class node in the from clause to obtain the from clause:
[0178] from students, electives, courses, colleges;
[0179] The where clause is converted to the corresponding foreign key according to the object attribute edge, Image 6 There are three object attribute sides in the middle: student_student number_selection_student number, course_selection_course code_course_course code, course_department code_college_department code, convert respectively, and get the where clause: where student. Student ID = Course Selection. Student ID and Course Selection. Course Code = Course. Course Code and Course. Department Code = College. Department Code
[0180] Finally, the attribute node is processed: the mapping node of "09131011" is "student_student ID", after conversion, student. Student ID = "09131011" is obtained, which is added to the where clause.
[0181] The final generated SQL statement is:
[0182] select*
[0183] from students, electives, courses, colleges
[0184] where student. student ID = course selection. student ID and course selection. Course code = course. Course code and course. Department code = college. Department code and student. Student ID = "09131011"
[0185] After the SQL statement is generated, the database is queried through the database query interface, and the result is finally returned to the user.