The invention relates to a Spark
SQL-based distributed full
text retrieval system and method. The
system comprises an
SQL translation layer, a
data source management layer, a parallel calculation layer and a distributed storage layer; an
SQL-based full
text retrieval method and translation processes, among modules of the SQL translation layer, of full
text retrieval SQL statements are proposed; a full text retrieval process parallelization method is designed in a
data source management module; and in a retrieval optimization module, two index storage models and corresponding primitive table
data reduction strategies during query are designed, wherein a partition align connection
algorithm which is used for reducing primitive table data during query and has a complexity of O (n) is designed for an index appointed column-based
storage model. Under the two storage models, the index construction time is shortened to 0.6% / 0.5% of the traditional
database, the query time is shortened to the 1% / 10% of the traditional
database, and the index storage amount is decreased to 55.0% of the traditional
database. According to the method, the Spark SQL
data analysis function is strengthened, and the requirements for traditional business migration and full text retrieval carried out on
mass data in the existing businesses can be satisfied.