A regional geological condition analysis system based on LLM and RAG technologies

By constructing a regional geological condition analysis system based on LLM and RAG technologies, the problems of numerical illusion and low spatiotemporal matching in the content generated by large language models in geological analysis were solved, and high-precision geological analysis report generation was achieved, ensuring the accuracy and reliability of the analysis results.

CN122309713APending Publication Date: 2026-06-30武汉智博创享科技股份有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
武汉智博创享科技股份有限公司
Filing Date
2026-04-07
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing geological analysis techniques based on large language models lack physical fact constraints, resulting in numerical illusions in the generated content. Furthermore, reliance on semantic retrieval leads to low spatiotemporal matching of geological data, failing to meet engineering-level accuracy requirements.

Method used

A regional geological condition analysis system based on LLM and RAG technologies is constructed, including a data access module, a data processing module, a knowledge base construction module, a reasoning and retrieval module, a generation and verification module, and a visualization module. By constructing a fact anchor database, a geological knowledge graph, and a composite index vector, the system performs graph reasoning and generation verification to ensure the accuracy and spatiotemporal matching of the generated content.

Benefits of technology

It ensures that the generated content strictly adheres to the constraints of geological and physical facts, guaranteeing the accuracy and reliability of engineering-level data, improving the spatiotemporal matching and recall rate of geological data retrieval, and solving the problems of spatial misalignment and stratigraphic confusion in traditional retrieval.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309713A_ABST
    Figure CN122309713A_ABST
Patent Text Reader

Abstract

This application relates to the field of geological information technology and discloses a regional geological condition analysis system based on LLM and RAG technologies. The system includes modules for data access, processing, knowledge base construction, reasoning retrieval, generation verification, and visualization. The method includes: performing hierarchical processing on multi-source geological data; extracting physical attributes from structured hard data to construct a fact anchor database and generating a composite index vector that integrates spatiotemporal semantics; responding to query requests, using a geological knowledge graph to perform meta-path reasoning to expand the query intent; retrieving geological document slices by calculating semantic, spatial, and hierarchical matching degrees; calling a large language model to generate an initial report, parsing key geological assertions to verify against the fact anchor database, and performing physical consistency correction on conflicting data. This invention, by introducing fact anchor verification and graph reasoning mechanisms, solves the problems of illusion and spatiotemporal misalignment in geological analysis caused by large models, thus improving the accuracy of reports.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of geological information technology, specifically to a regional geological condition analysis system based on LLM and RAG technologies. Background Technology

[0002] With the development of artificial intelligence technology, large language models (LLMs) have demonstrated powerful capabilities in processing massive amounts of unstructured text. To overcome the "illusion" phenomenon and the problem of knowledge lag in specialized fields, retrieval-enhanced generation (RAG) technology has been widely used. RAG technology improves the credibility and interpretability of generated content to a certain extent by using retrieval results from external knowledge bases as contextual input to the model.

[0003] However, directly applying the general RAG framework to regional geological condition analysis still faces challenges in meeting engineering-level accuracy requirements. Existing retrieval technologies primarily rely on semantic similarity of text for matching. However, geological data possesses strong spatiotemporal attributes. Semantically similar descriptions (such as "strong karst development"), if separated from specific geographic coordinates and geological ages, may lead to spatial misalignment or stratigraphic confusion in search results, failing to accurately respond to the analytical needs of specific sites or strata. Furthermore, geological analysis reports have strict constraints on numerical accuracy and physical logic. Existing generative systems lack verification mechanisms for structured hard data (such as borehole measurement data and in-situ test indicators), resulting in conflicts between the geological parameters generated by large models and actual physical exploration data. This makes them difficult to directly apply to engineering construction and disaster prevention and mitigation fields where safety and accuracy requirements are extremely high.

[0004] Therefore, this invention proposes a regional geological condition analysis system based on LLM and RAG technologies to address the shortcomings of existing technologies. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides a regional geological condition analysis system based on LLM and RAG technologies. This system solves the problems of numerical illusions in the generated content caused by the lack of physical fact constraints in existing geological analysis technologies based on large language models, and low spatiotemporal matching of geological data due to reliance on semantic retrieval.

[0006] To achieve the above objectives, the present invention provides the following technical solution: a regional geological condition analysis system based on LLM and RAG technologies, comprising: The data access module is configured to receive multi-source heterogeneous geological raw data, including unstructured text data, structured hard data, and three-dimensional spatial data. The data processing module is configured to clean and deconstruct the geological raw data, including performing semantic segmentation and metadata extraction on the unstructured text data, and extracting physical attributes and geometric features from the structured hard data and the three-dimensional spatial data to construct a fact anchor database. The knowledge base construction module is configured to store geological knowledge graphs and composite index vectors. The geological knowledge graphs contain a network of geological entity relationships, and the composite index vectors are generated by fusing semantic features, spatial features, and temporal features. The reasoning retrieval module is configured to identify the core entities in the query request and perform graph reasoning to generate an extended query vector, and recall geological document slices based on the extended query vector and a weighted spatiotemporal semantic scoring algorithm. The generation verification module is configured to call the large language model to generate an initial geological analysis report, parse the geological assertions in the initial geological analysis report and compare them with the data in the fact anchor database, and perform a correction operation when a conflict is detected. The visualization module is configured to call and render 3D geological model slices, borehole columnar sections, or geological profiles based on the spatial attributes of the search results.

[0007] Preferably, when processing the unstructured text data, the data processing module performs a text segmentation operation based on metadata inheritance: parsing the document's hierarchical structure tree to identify the titles and corresponding body paragraphs; when generating basic text blocks, tracing back the parent path to which the basic text block belongs, and concatenating the titles at each level as metadata to the front of the basic text block to form slice content containing limited domain information; performing named entity recognition on the slice content, extracting implicit spatial coordinate descriptions and geological time entities, and mapping them as spatial labels and time labels respectively.

[0008] Preferably, when constructing the fact anchor point database, the data processing module performs the following operations: for borehole data, it extracts the borehole coordinates and layer depths, calculates the absolute spatial coordinates of the layer interface by combining the inclination data, and transforms the borehole layer records into borehole anchor points containing geotechnical physical properties; for three-dimensional geological model data, it parses geometric primitive information, and transforms continuous geological objects into discrete model anchor points through spatial grid sampling or centroid extraction; it stores the borehole anchor points and model anchor points uniformly, establishes a multi-dimensional spatial index, and the data structure of each anchor point includes a globally unique index key, a planar coordinate vector, a vertical effective depth range, a standardized attribute vector, and data source metadata.

[0009] Preferably, the geological knowledge graph in the knowledge base construction module is constructed in the following way: defining a geological domain ontology and instantiating the extracted geological objects as graph nodes; parsing the spatial topological state between geological bodies in the 3D geological model; if geometric surfaces intersect and have fault attributes, establishing cutting semantic relationship edges; if contact surface normal vectors are continuous, establishing integrated contact semantic relationship edges; constructing a heterogeneous graph network, using graph convolutional networks to aggregate the neighborhood features of nodes, calculating the Euclidean distance between text entity nodes and model entity nodes in the embedding space, and establishing alignment links for cross-modal entities.

[0010] Preferably, the knowledge base construction module generates the composite index vector in the following ways: extracting the basic semantic vector of the text slice using a pre-trained language model; mapping the two-dimensional plane coordinates into a high-dimensional space vector using a multi-scale sinusoidal position coding algorithm; mapping the absolute age value or relative stratigraphic index value of geological time into a time feature vector; aligning the dimensions of each component using a learnable projection matrix; and merging the basic semantic vector, the high-dimensional space vector, and the time feature vector into a composite index vector using a weighted concatenation strategy.

[0011] Preferably, when generating the extended query vector, the reasoning retrieval module performs the following operations: extracting key geological entities from the query request as seed entities and anchoring them in the geological knowledge graph; based on predefined geological metapaths containing causal chains and attribute associations, performing multi-hop reasoning starting from the seed entities to mine implicit related entities and add them to the extended entity set; using a semantic encoder to generate the semantic vector of the original query and obtaining the pre-trained embedding vector of each entity in the extended entity set; calculating weights based on semantic relevance and the number of hops in the reasoning path; and performing weighted fusion of the semantic vector of the original query and the vector of the extended entity set.

[0012] Preferably, the weighted spatiotemporal semantic scoring algorithm executed by the reasoning retrieval module includes: calculating the semantic similarity between the extended query vector and the composite index vector of the candidate document slices; calculating the Euclidean distance between the coordinates of the query center point and the coordinates of the spatial anchor point associated with the document slice, and calculating the spatial proximity based on the Gaussian kernel function; extracting the geological age or sequence index implied in the query and comparing it with the geological attributes of the document slices to calculate the stratigraphic sequence matching degree; obtaining a comprehensive matching score by linearly weighting and fusing the semantic similarity, spatial proximity and stratigraphic sequence matching degree, and re-ranking the recalled document slices accordingly.

[0013] Preferably, the generation controller in the generation verification module is configured to: receive the query request and the retrieved Top-K document slice set; construct a prompt word context according to a preset template, the prompt word context including a system instruction area injected with anti-illusion instructions, a factual basis area arranged by relevance score and assigned index identifiers, and a task generation area filled in the query request; input the prompt word context into a large language model, and generate the initial geological analysis report containing the reference index through autoregressive decoding.

[0014] Preferably, when performing the correction operation, the generation verification module is configured to: extract key statements containing numerical values, location, and attribute descriptions from the initial geological analysis report, convert them into geological assertion tuples, wherein the geological assertion tuples contain spatial range, attribute name, and generated value; use the spatial range as the index key to initiate a query to the fact anchor database, obtain the set of valid anchor points within the corresponding range, and calculate the reference truth value; calculate the relative error or semantic matching score between the generated value and the reference truth value; if the relative error exceeds a preset tolerance threshold or the semantic matching score is lower than the threshold, a conflict is determined to exist; replace the erroneous numerical values ​​in the initial geological analysis report with the reference truth value or rewrite the relevant attribute descriptions, and add data source annotations.

[0015] Preferably, the visualization module is configured to: parse the coordinate set of all spatial points involved in the verified geological analysis report and calculate the axis-aligned bounding box; calculate the optimal observation point position and line-of-sight target point of the virtual camera based on the axis-aligned bounding box; retrieve model data that have spatial intersection with the axis-aligned bounding box from the 3D spatial database using an octree or 3D R-Tree index structure; dynamically schedule model data at different levels of detail based on the camera distance, and establish an interactive mapping relationship between geological entity identifiers in text paragraphs and geometric primitive object identifiers in the 3D scene.

[0016] This invention provides a regional geological condition analysis system based on LLM and RAG technologies. It has the following beneficial effects: 1. This invention effectively solves the illusion problem in geological numerical analysis using large language models by constructing a fact anchor database based on structured hard data and introducing a generation and verification module. The system parses the generated unstructured text into verifiable geological assertions and compares them with physical anchor data such as borehole data and 3D models. Automatic correction is triggered once a numerical or attribute deviation exceeds a threshold. This mechanism ensures that the final analysis report strictly adheres to the constraints of geological and physical facts, guaranteeing the accuracy and reliability of engineering-level data.

[0017] 2. This invention utilizes a geological knowledge graph and reasoning retrieval module to achieve deep semantic query expansion based on meta-paths. For causal relationships or technical terms implicit in user queries, the graph reasoning engine can mine potential related entities along predefined reasoning templates and merge the expanded results into an enhanced query vector. This mechanism breaks through the limitations of traditional keyword retrieval, enabling the recall of semantically related but literally mismatched geological document slices, significantly improving the recall and completeness of the retrieval.

[0018] 3. This invention employs a composite index vector integrating semantic, spatial, and temporal features, along with a hybrid retrieval strategy, significantly improving the spatiotemporal matching degree of geological data retrieval. By semantically segmenting text data and combining multi-scale spatial coding and geological time embedding techniques, the system can simultaneously measure semantic similarity, spatial proximity, and stratigraphic matching degree in geological document retrieval. This effectively filters out interfering information that is semantically highly relevant but geographically distant or geologically inconsistent, ensuring that the contextual basis used to generate analysis reports has a high degree of regional specificity and temporal logic. Attached Figure Description

[0019] Figure 1 This is a schematic diagram of the system architecture of the present invention; Figure 2 This is a schematic diagram of the method flow of the present invention; Figure 3 This is a schematic diagram illustrating the principle of knowledge graph-based query expansion and multidimensional hybrid retrieval of the present invention; Figure 4 This is a logical diagram of the geological assertion verification and self-correction mechanism of the present invention.

[0020] Among them, 110 is the data access module; 120 is the data processing module; 130 is the knowledge base construction module; 140 is the reasoning and retrieval module; 150 is the generation and verification module; and 160 is the visualization module. Detailed Implementation

[0021] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0022] See attached document Figure 1 , Figure 1 This is a schematic diagram of a regional geological condition analysis system based on LLM and RAG technologies according to an embodiment of the present invention. The present invention provides a regional geological condition analysis system based on LLM and RAG technologies, the system comprising: The data access module 110 is configured to receive multi-source heterogeneous geological raw data, including unstructured text data, structured hard data, and three-dimensional spatial data.

[0023] The data processing module 120 is configured to clean and deconstruct the raw geological data, including performing semantic segmentation and metadata extraction on unstructured text data, and extracting physical attributes and geometric features from structured hard data and three-dimensional spatial data to construct a fact anchor database.

[0024] The knowledge base construction module 130 is configured to store geological knowledge graphs and composite index vectors. The geological knowledge graph contains a network of geological entity relationships, and the composite index vectors are generated by fusing semantic features, spatial features, and temporal features.

[0025] The reasoning retrieval module 140 is configured to identify the core entities in the query request and perform graph reasoning to generate an extended query vector, and recall geological document slices based on the extended query vector and a weighted spatiotemporal semantic scoring algorithm.

[0026] The generation verification module 150 is configured to call the large language model to generate an initial geological analysis report, parse the geological assertions in the initial geological analysis report and compare them with the data in the fact anchor database, and perform correction operations when conflicts are detected.

[0027] The visualization module 160 is configured to call and render three-dimensional geological model slices, borehole columnar sections, or geological profiles based on the spatial attributes of the search results.

[0028] See attached document Figure 2 , Figure 2 This is a flowchart of a regional geological condition analysis method based on LLM and RAG technologies according to an embodiment of the present invention. The present invention provides a regional geological condition analysis method based on LLM and RAG technologies, comprising the following steps: S100 performs hierarchical processing on the accessed multi-source heterogeneous geological raw data, extracts physical attributes from the structured hard data to construct a fact anchor database, and transforms the processed data into a composite index vector containing semantic, spatial and temporal features. S200, in response to user-input query requests, identifies core entities and performs multi-hop reasoning in the geological knowledge graph, mining related concepts to generate extended query vectors; S300 uses extended query vectors to perform hybrid retrieval and recalls geological document slices by calculating semantic similarity, spatial proximity and sequence matching degree. S400 inputs geological document slices into a large language model to generate an initial analysis report, parses the key geological assertions and initiates verification to the fact anchor database. If data conflicts exist, corrections are performed. The S500 outputs a validated geological analysis report and renders the corresponding 3D geological model or geological map based on the spatial scope involved.

[0029] See attached document Figure 2 The data access module 110 is configured to establish an input channel for multi-source heterogeneous data. For geological raw data in different storage formats, the data access module 110 integrates multiple adapter interfaces to achieve standardized data reading. For borehole data tables and geophysical parameter tables stored in relational databases (such as MySQL and PostgreSQL), the data access module 110 establishes a data pipeline through database connection protocols (such as JDBC or ODBC) and executes SQL queries to extract records in batches. For unstructured document files (such as geological exploration reports and standards in PDF and Word formats), the data access module 110 uses a Document Object Model (DOM) parser or Optical Character Recognition (OCR) interface to convert the file stream into an editable text stream. For three-dimensional spatial data (such as geological models in Obj, GOCAD, and 3DTiles formats), the data access module 110 calls a geometric kernel parser to read vertex coordinates, mesh topology, and texture mapping information.

[0030] During the data access phase, the system implements a data layering and governance strategy, dividing the raw data into soft data sets based on the data's information carrying modality. With hard data sets Among them, soft data sets It mainly includes qualitative data described primarily in natural language, such as geological description texts, design documents, and accompanying two-dimensional images. This type of data is characterized by its rich semantics but lacks structured logic that machines can directly compute; hard data sets. It mainly includes quantitative data based on numerical values, coordinates, and geometric topology, such as borehole columnar section data, in-situ test data, and three-dimensional geological models. The characteristic of this type of data is that it has clear spatial coordinate attributes and physical attribute constraints.

[0031] The unstructured data processing unit in data processing module 120 is designed for soft data sets. Perform semantic enhancement segmentation. This process aims to address the semantic loss caused by context truncation during the segmentation of long documents.

[0032] S110, perform text segmentation operation based on metadata inheritance; the unstructured data processing unit first parses the hierarchical structure tree of the document, identifies the chapter, section, subsection titles and corresponding body paragraphs in the document; when segmenting the body paragraphs to generate basic text blocks, the system forcibly traces back the parent path to which the text block belongs, and splices the titles at all levels as metadata to the front of the basic text block.

[0033] Define the original document Divided into a set of sequence slices For any ; slice Its content is formally represented as follows: ; in, Indicates the first The basic text content of each slice Indicates the first The sequence of hierarchical headings to which each slice belongs. This indicates a string concatenation operation. For example, when processing a description of "silty soil" in the "4.1 Stratigraphy and Lithology" section under "Chapter 4 Engineering Geological Conditions", the generated slice content explicitly includes the limiting domain information "Engineering Geological Conditions - Stratigraphy and Lithology", thus ensuring that the slice has a clear semantic belonging during subsequent retrieval.

[0034] S120, performs spatiotemporal label extraction and mapping operations; after completing the segmentation, the system processes each slice... Perform named entity recognition (NER) or regular expression matching to extract implicit spatial and temporal features from the text.

[0035] For spatial features, the system uses regular expressions to match absolute coordinate descriptions in the text (such as "X=...Y=...") and directly extracts the coordinate values ​​as spatial labels. If the text only contains place names (such as "Plot A" or "Fault Zone B"), the system calls a pre-built place name address database or geocoding service to map the place name to the corresponding geospatial grid code (such as Geohash code or S2 grid ID). Let the mapping function be... Extracted spatial tags Represented as: ; For time features, the system identifies geological time entities in the text (such as "Cretaceous" and "Quaternary") and maps them to absolute geological age ranges based on the International Stratigraphic Age (ICS). If the text description is a relative stratigraphic sequence (such as "overlying layer" or "underlying layer"), then assign it a relative stratigraphic sequence index value based on the regional stratigraphic column knowledge base; extract the time tag. Used for subsequent time dimension encoding; for the specific implementation of named entity recognition algorithm and regular expression matching, those skilled in the art can use existing natural language processing toolkits, and will not be elaborated here.

[0036] The structured data processing unit in data processing module 120 performs anchor point extraction and transformation operations on structured hard data; this step aims to transform heterogeneous engineering numerical records into unified spatial fact anchor points in order to build a fact anchor point database that can logically constrain the generated content.

[0037] S130: Perform spatial discretization and attribute mapping of borehole data; the structured data processing unit reads the borehole coordinate table, inclinometer data table, and soil stratification table from the borehole database; for each borehole record, the system first extracts the three-dimensional coordinates of the borehole opening. For the first in the borehole Each layer record is used to read its top plate depth. Bottom plate depth and the corresponding set of geological attributes (Including lithology name, standard penetration test blow count, water content, and physical and mechanical properties).

[0038] To transform a one-dimensional depth record into a solid segment in three-dimensional space, the system calculates the absolute spatial coordinates of the layered interface using inclinometer data; assuming the borehole is a vertical hole (for inclined holes, the conventional minimum curvature method is used for trajectory calculation, which will not be elaborated here), the... Vertical span of layers in three-dimensional space The calculation is as follows: ; Based on the above calculations, the system generates a set of anchor points corresponding to the borehole data; each layer is discretized into one or more spatial anchor points with attributes. For the first layer... The center position of the layer, the defined borehole anchor point Represented as: ; in, It serves as a unique identifier for boreholes and is used for data traceability.

[0039] S140 performs geometric analysis and solidification sampling of the 3D geological model; for the 3D geological model file (such as data stored in the form of triangular mesh or volume elements), the structured data processing unit parses the geometric primitive information of the model; for each independent geological object in the model (such as a specific stratum, fault or ore body), the system reads its vertex list, face index list and associated attribute table.

[0040] To transform continuous geometric models into discrete verification anchor points, the system performs spatial mesh sampling or centroid extraction within the model bounding box; for volumetric models, it directly extracts the coordinates of the volumetric centers and their attribute values; for surface models constructed using the boundary representation method (B-Rep), it calculates the spatial extent of the entity within the model using a ray casting algorithm and generates a set of spatial points representing the geological entity; for the first... Each sampling point defines the model anchor point. Represented as: ; in, A unique code for geological objects. The three-dimensional coordinates of the sampling point This is the set of attributes of the geological body to which this coordinate point belongs.

[0041] S150, Construct and index the fact anchor database; the system integrates the borehole anchors and model anchors generated in steps S130 and S140 into a relational database or spatiotemporal database to form the fact anchor database; the storage mode of this database is defined as a quintuple structure: ; in: The globally unique index key for the anchor point; It is a planar coordinate vector; This refers to the effective depth range of the anchor point in the vertical direction. The standardized attribute vector contains lithology category codes and physical and mechanical parameter values; This is a metadata field for the data source, pointing to the original borehole number or model file name.

[0042] To support subsequent high-frequency spatial queries and validations, the structured data processing unit uses the fact anchor database... Create a multi-dimensional spatial index (such as an R-Tree index or a GiST index) on the field; this spatial index allows the system to quickly retrieve the given coordinate radius range when a verification request is received. All fact anchors within, i.e., satisfying set This provides a physical factual basis for verifying the logical consistency of the generated content; through the above steps, the system completes the transformation from raw engineering data to calculable and verifiable facts.

[0043] The geological knowledge graph storage unit in the knowledge base construction module 130 performs the topological semantic construction operation of the geological knowledge graph; this step aims to transform the scattered geological entities and their implicit three-dimensional spatial relationships into explicit graph structure data to support subsequent associative reasoning.

[0044] S210, Define the geological domain ontology and initialize the map nodes; The system presets the geological domain ontology model, which includes a set of geological entity types and a set of attributes; Geological entity types cover stratigraphic units, fault structures, fold structures and hydrogeological units; The attribute set covers stratigraphic thickness, permeability coefficient and rock mechanical parameters; The system traverses the preprocessed text data and 3D model metadata, and instantiates the extracted geological objects into nodes in the map.

[0045] S220 performs semantic transformation of 3D geometric topological relationships. Unlike traditional methods that rely solely on text co-occurrence to extract relationships, this step uses computational geometry algorithms to parse the spatial topological state between geological bodies in the 3D geological model and maps it to relational edges with geological semantics. The system reads mesh objects from the 3D geological model, and for any two spatially adjacent geological body objects... and Calculate the result of its geometric Boolean operation and the orientation of its normal vector; if an object is detected... Geometry and objects The geometric surfaces intersect, and If it is marked as a fault attribute, then a system is established in the atlas from... point to The "Cut" semantic relation edge; if an object is detected lie in If the normal vectors of the contact surfaces of the two surfaces are continuous, then a semantic relationship edge of "integrated contact" is established; if the contact surfaces have erosion surface features, then a relationship edge of "unintegrated contact" is established; through this process, the geometric positional relationships in the physical world are transformed into logical reasoning paths in the knowledge graph.

[0046] S230 performs multi-source entity alignment based on graph neural networks. Addressing the issue of inconsistent identification of the same geological entity in text descriptions (e.g., "F1 fault") and 3D model naming (e.g., "Fault_01_Obj"), the system constructs a heterogeneous graph network and aggregates the neighborhood features of nodes using graph convolutional networks (GCNs). The system uses the semantic vectors of text entities and the spatial centroid coordinates of model entities as initial feature inputs, updating node representations through multi-layer graph convolution operations. It calculates the Euclidean distance between text entity nodes and model entity nodes in the embedding space; if the distance is less than a preset threshold, they are determined to be the same entity, and an alignment link is established, thereby achieving cross-modal knowledge fusion.

[0047] The spatiotemporal dual-encoding vector library in the knowledge base construction module 130 performs the spatiotemporal dual-encoding index generation operation; the core of this step is to map discrete geographic coordinates and geological time into a continuous vector space and deeply integrate it with text semantic vectors, thereby constructing a composite index that supports multidimensional retrieval.

[0048] S240 generates text semantic vectors; the system calls a pre-trained language model (such as BERT-Geology) fine-tuned with geological corpus to slice the text. Input the model encoder and extract the mean of its last hidden state as the basic semantic vector. .

[0049] S250, Generate Spatiotemporal Encoding Embedsion; This step is used to convert the spatial labels (coordinates) and time labels (geological ages) associated with text slices into computable vector features.

[0050] To address spatial characteristics and preserve geographic proximity and directionality, the system employs a multi-scale sinusoidal position coding algorithm to map two-dimensional planar coordinates (x, y) into high-dimensional spatial vectors; for each dimension of the coordinates... Map it to a dimension of vector The first vector and The formula for calculating each component is as follows: ; ; in, For dimensional indexing, The total dimension of the spatial encoding; the final spatial encoding vector. It is composed of x-axis encoding and y-axis encoding; this encoding method enables the model to directly perceive the Euclidean distance relationship in physical space through vector dot product.

[0051] Based on temporal characteristics, the system calculates the absolute age value of geological time. (Unit: million years, Ma) Input to a one-dimensional temporal embedding layer is mapped to a temporal feature vector through a learnable linear transformation and activation function. For descriptions that only contain relative hierarchical orders (such as "overriding"), the system converts them into the corresponding hierarchical index values ​​for embedding.

[0052] S260, Constructing a composite index vector; the system employs a weighted concatenation strategy to fuse semantic vectors, spatial vectors, and temporal vectors into the final composite index vector. To balance the impact of different feature dimensions on the retrieval results, the system introduces a learnable projection matrix to perform dimension alignment and feature filtering for each component; the logic for generating the composite vector is as follows: ; in, and These are the spatial feature projection matrix and the temporal feature projection matrix, respectively. This is a vector concatenation operation; the generated vectors are... It is stored in a vector database as a direct index for subsequent hybrid searches. This composite vector structure ensures that the search process can simultaneously respond to multi-dimensional query intents such as "about what content," "where it is located," and "what geological period it belongs to."

[0053] Reference Appendix Figure 3 The graph reasoning engine in the reasoning retrieval module 140 is configured to perform query expansion operations based on knowledge graphs. This operation aims to address the problem of low retrieval recall caused by incomplete user query expressions or implicit connections between geological terms (such as asking "karst" but actually implying concern about "collapse risk").

[0054] S310 performs core entity recognition and graph anchoring; the graph inference engine receives natural language query text input by the user. The system uses a pre-built geological domain named entity recognition model (e.g., based on BiLSTM-CRF architecture or BERT fine-tuning model) to extract key geological entities (such as place names, stratigraphic ages, lithology, and structural names) from the query text as a seed entity set. The system maps each entity in the seed entity set to a unique node identifier in the geological knowledge graph storage unit. If there is entity ambiguity, the system calculates the cosine similarity between the context semantic vector of the query text and the neighborhood feature vector of the candidate node in the graph, and selects the node with the highest similarity to complete the entity link.

[0055] S320 executes multi-hop reasoning based on meta-paths. The graph reasoning engine starts with a seed entity and performs depth-first or breadth-first search within the geological knowledge graph to uncover implicit related entities. To constrain the relevance of the reasoning process, the system predefines multiple sets of geological meta-paths as reasoning templates; for example, the meta-path for engineering risk analysis is defined as "Geological entity → Contains → Soil and rock type → Has attributes → Physical and mechanical properties → Causes → Engineering risk". The system traverses the graph along this path, adding the nodes at the end of the path (such as "collapseability", "weak interlayer", "water inrush") to the extended entity set. This process expands what was originally a keyword-based query into a semantic network that includes causal chains and attribute associations.

[0056] S330, Generate semantically enhanced extended query vectors; to integrate the reasoning results into the retrieval process, the system performs vectorized weighted fusion of the original query and extended entities; first, the semantic encoder is used to transform the original query text... Convert to basic query vector Simultaneously, obtain the extended entity set. Each entity in the graph corresponds to a feature vector in the graph embedding space.

[0057] To avoid irrelevant entities introducing noise, the system calculates the semantic relevance of each expanded entity to the original query vector as a weight; the final expanded query vector... The calculation logic is as follows: ; in, This is the semantic vector of the original query. To control the global adjustment coefficient of the influence of extended information, To extend entities The pre-trained embedding vectors, The local attention weight for the entity depends on the number of hops the entity makes in the reasoning path and its relevance score to the query intent. Through this step, the system constructs an enhanced query vector that includes both the user's explicit intent and the knowledge of geological experts (internalized through the graph structure), providing a computational basis for subsequent high-precision hybrid retrieval.

[0058] The hybrid retrieval executor performs a weighted spatiotemporal semantic hybrid retrieval operation. This step aims to use extended query vectors to recall records from massive geological document slices that are highly consistent with the query intent in terms of semantic content, geographical location, and geological age, thus solving the spatial misalignment and stratigraphic confusion problems caused by traditional retrieval relying solely on keyword matching.

[0059] S340, Construct a multi-dimensional comprehensive scoring model; System define the retrieval scoring function. Used for quantitative queries With any candidate document slice The degree of matching between them; the scoring function integrates three independent dimensions of measurement indicators: semantic similarity, spatial proximity and stratigraphic sequence matching.

[0060] Regarding the semantic similarity dimension, the system uses preset dimension segmentation rules to select from the stored composite index vector. Decoupling semantic components And extract semantic components from the extended query vector. By calculating the cosine similarity between the two documents, the relevance of the query intent to the document content in terms of linguistic features is measured; semantic similarity. The calculation expression is as follows: ; in, This represents the Euclidean norm of a vector.

[0061] For the spatial proximity dimension, the system obtains the coordinates of the query center point. (Retrieved from parsing place name entities in the query) Spatial anchor point coordinates associated with document slices To adhere to the first law of geography—that things spatially closer are more correlated—the system employs distance decay logic based on a Gaussian kernel function. This logic ensures that search results prioritize records with similar geographical locations and suppress noisy data from distant locations. Spatial Proximity The calculation expression is as follows: ; in, This represents the Euclidean distance between the query point and the document anchor point. This is a spatial bandwidth parameter used to control the sensitivity to distance attenuation; the system dynamically adjusts it according to the scale of the search range (e.g., site-level, region-level). The value of .

[0062] Based on the stratigraphic sequence matching dimension, the system extracts the implied geological time intervals or standard stratigraphic sequence indexes from the query. Geological attributes of document slice annotations A comparison is performed; a matching score is calculated using an indicator function or sequence distance function to ensure that the search results conform to the temporal logic of geological evolution. Stratigraphic sequence matching degree. The calculation expression is as follows: ; in, This is a Boolean indicator function that takes the value 1 if the document order belongs to the query order range, and 0 otherwise. This indicates the sequence interval distance in a stratigraphic column; This represents the weighting coefficient for sequence proximity.

[0063] Finally, the system calculates the comprehensive matching score between the query and the document slice by linearly weighting and fusing the scores from the three dimensions mentioned above. : ; in, These are the weight coefficients for semantic, spatial, and hierarchical dimensions, respectively, and satisfy the following conditions: The system adaptively adjusts the weights based on the classification results of the query intent; for example, it automatically increases the weights for "location-sensitive" queries. The value of .

[0064] S350 performs Top-K recall and re-ranking; the hybrid retrieval executor adopts a two-stage retrieval strategy of "coarse recall and fine ranking". First, using a hierarchical navigation small-world algorithm or inverted index structure, it quickly filters out preliminary matches in the vector database based on the similarity of composite vectors. Candidate slices ( ); subsequently, regarding this Each candidate slice utilizes the aforementioned comprehensive matching score Precise calculations and re-ranking are performed. During this process, document quality factors (including data source authority, chart richness, and data timeliness) are introduced to perform a secondary fine-grained ranking of candidate slices, ultimately selecting the top-ranked slices. Each document slice serves as contextual information and is fed into the subsequent generation module. Through this weighted hybrid retrieval strategy, the system can effectively filter out invalid and interfering information that is "semantically related but geographically far apart" or "locations overlap but geological ages do not match," ensuring the accuracy of geological analysis.

[0065] Reference Appendix Figure 4 The generation controller in the generation verification module 150 performs context construction and initial report generation operations; this step aims to use the retrieved high-quality geological data fragments to guide the large language model to generate logically consistent analytical text, applying the retrieval enhancement (RAG) paradigm to the geological professional field.

[0066] S410, Construct a structured prompt context; generate a controller to receive the user's query text. and the Top-K document slice set of mixed retrieval output The system does not directly stitch the slices together; instead, it constructs an input vector according to a preset prompt template. This template contains three functional areas: a system instruction area, a factual basis area, and a task generation area. In the system instruction area, constraint instructions such as "based on geological facts," "cited sources," and "prohibit illusions" are injected. In the factual basis area, the document slices are... Slices are sorted in descending order of relevance score, and each slice is assigned a unique index identifier (e.g., [Ref-1], [Ref-2]). In the task generation area, the user query is entered. and output format requirements. The complete context of the generated prompt words. Input to a large language model.

[0067] S420 generates the initial geological analysis report. The large language model is based on the context of the input. Initial geological analysis text is generated through autoregressive decoding. During this process, the model automatically marks the referenced slice indexes when stating specific geological parameters or conclusions according to the instructions.

[0068] The anchor point validator in the generation and verification module 150 performs hard constraint verification operations based on the anchor point data. This step constitutes the core self-correcting closed loop of the system, which solves the "illusion" problem that large models are prone to when dealing with numerical and spatial relationships by comparing the generated unstructured text with the structured hard data in the database to ensure physical consistency.

[0069] S430, parsing geological assertion tuples; anchor point validator on initial text. The process involves extracting key statements containing numerical values, location information, and attribute descriptions, and transforming them into a set of verifiable geological assertion tuples. For the first An assertion, whose tuple structure is defined as: ; in, This indicates that the assertion pertains to a spatial range or specific location (e.g., "at a depth of 15 meters in borehole ZK03"). Indicates the name of the geological attribute (such as "bearing capacity characteristic value"). This step represents numerical or qualitative descriptions in the generated text. It utilizes a rule-based parser or a dedicated extraction model with a small number of parameters, capable of recognizing logical relationships such as "greater than," "less than," and "between."

[0070] S440 performs conflict detection and physical consistency verification. (For the set) Each assertion in The system is based on As spatial index key, with Using the attribute filter key, a query is initiated to the fact anchor database; the database uses the R-Tree index to quickly retrieve the set of valid anchors within that spatial range. And calculate statistical values ​​(such as the mean, median, or exact recorded value) as reference truth values. .

[0071] System-defined verification function Used to determine consistency; when the attribute is numeric, it calculates the relative error between the generated value and the reference true value. : ; like Exceeding the preset tolerance threshold (For example, 0.05, i.e., 5% error limit), then the assertion is judged to have a "numerical conflict"; when the attribute is a qualitative description (such as lithology), the semantic matching score between the generated description and the anchor attribute is calculated. If the score is lower than the threshold, it is judged as an "attribute conflict"; if If there is no corresponding data record in the database, it is marked as "unfounded assertion".

[0072] S450 performs self-correction and traceability annotation; when a conflict is detected, the anchor validator activates the correction mechanism; for numerical conflicts, the system directly utilizes the reference truth value. Replace the incorrect values ​​in the text The system adds verification notes (e.g., "Corrected from ZK03 measured data") to the corrections. For attribute conflicts, the system rewrites the relevant sentences using the standard attribute descriptions in the anchor database; for unfounded assertions, the system highlights them in the final report or removes them directly. The text after verification and correction... It is supported by rigorous data, which ensures the reliability of the geological analysis conclusions.

[0073] The rendering scheduling unit in visualization module 160 performs spatial indexing and scene construction operations. This step aims to address the lack of intuitive spatial awareness in plain text reports by automatically retrieving and displaying the corresponding 3D geological scene through calculation and analysis of the report's spatial fingerprint.

[0074] S510, resolving spatial bounding box and viewport calculations; the system first receives the final geological analysis report generated after verification. The rendering scheduling unit traverses the coordinate set of all associated geological entities and anchor point data in the report; let the coordinate set of all spatial points involved in the report be... To determine the optimal viewport extent of the 3D scene, the system calculates the axis-aligned bounding box of the point set; the bounding box is defined. for: ; in, A preset spatial buffer distance is used to preserve the surrounding environmental context within the visible range, preventing target geological bodies from being too close to the edge of the viewport; based on the calculated... The system further calculates the optimal observation point location for the virtual camera. and line-of-sight target point ; Set as the geometric center of the bounding box. Then, depending on the camera's field of view. (FOV) and diagonal length of the enclosure Perform a reverse simulation to ensure that the bounding box falls completely within the view frustum.

[0075] S520 performs multi-resolution spatial indexing and data loading; based on deterministic... Within the scope, the system initiates a query request to the 3D spatial database; for massive 3D geological model data (such as oblique photogrammetry models or geological body meshes), the system uses an octree or a 3D R-Tree index structure to quickly retrieve data related to... Tiles or grid blocks that have spatial intersections.

[0076] To optimize rendering performance, the system adjusts the camera distance. Dynamically schedule model data at different levels of detail (LOD); when When the size is small, load high-precision triangular meshes and textures; when... For larger datasets, only simplified low-poly data is loaded; for borehole data, the system retrieves all borehole records within the specified range and converts them into 3D cylindrical primitives or bar chart objects; for analyses involving geological profiles, the system utilizes... Defined planar parameters are used to perform dynamic Boolean cutting operations on the geological body model or generate dynamic profile geometry data based on texture mapping of pre-generated slices.

[0077] S530 establishes a two-way interactive mapping mechanism between text and images; to achieve deep integration between analysis reports and 3D scenes, the system constructs an entity ID mapping table in the front-end rendering layer; this mapping table maintains geological entity identifiers in text paragraphs. Identifiers of geometric primitive objects in a 3D scene A one-to-one correspondence between them.

[0078] When a user triggers a click or hover event in the text area of ​​an analytics report, the system captures the associated text. Look up the mapping table to get the corresponding The system sends a highlighting command to the 3D rendering engine, changing the material color of the corresponding geological feature to a highly conspicuous color (such as red), and simultaneously drives the virtual camera to execute a smooth fly-in animation, focusing on the geological feature. Conversely, when a user clicks and picks a geological model primitive in the 3D scene, the system uses a raycasting algorithm to obtain the primitive's... The system uses a reverse index to navigate to the corresponding paragraph in the report and triggers automatic scrolling of the text box, placing the relevant analysis content at the center of the reader's field of vision. For the specific implementation of the ray casting algorithm and DOM event listening, those skilled in the art can use existing graphics rendering engine interfaces and front-end frameworks. Through these steps, the system transforms a static geological report into a spatially aware, interactive digital twin scene.

[0079] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A regional geological condition analysis system based on LLM and RAG technologies, characterized in that, include: The data access module is configured to receive multi-source heterogeneous geological raw data, including unstructured text data, structured hard data, and three-dimensional spatial data. The data processing module is configured to clean and deconstruct the geological raw data, including performing semantic segmentation and metadata extraction on the unstructured text data, and extracting physical attributes and geometric features from the structured hard data and the three-dimensional spatial data to construct a fact anchor database. The knowledge base construction module is configured to store geological knowledge graphs and composite index vectors. The geological knowledge graphs contain a network of geological entity relationships, and the composite index vectors are generated by fusing semantic features, spatial features, and temporal features. The reasoning retrieval module is configured to identify the core entities in the query request and perform graph reasoning to generate an extended query vector, and recall geological document slices based on the extended query vector and a weighted spatiotemporal semantic scoring algorithm. The generation verification module is configured to call the large language model to generate an initial geological analysis report, parse the geological assertions in the initial geological analysis report and compare them with the data in the fact anchor database, and perform a correction operation when a conflict is detected. The visualization module is configured to call and render 3D geological model slices, borehole columnar sections, or geological profiles based on the spatial attributes of the search results.

2. The regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, When processing the unstructured text data, the data processing module performs a text chunking operation based on metadata inheritance: Parse the document's hierarchical structure tree to identify headings and their corresponding body paragraphs; When generating a basic text block, the parent path to which the basic text block belongs is traced back, and the headings at each level are appended as metadata to the front of the basic text block to form sliced ​​content containing limited field information. Named entity recognition is performed on the sliced ​​content to extract the implicit spatial coordinate descriptions and geological time entities, and these are mapped to spatial labels and time labels respectively.

3. The regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, When constructing the fact anchor database, the data processing module performs the following operations: For borehole data, the borehole coordinates and layer depths are extracted, and the absolute spatial coordinates of the layer interface are calculated by combining the inclination data. The borehole layer records are then converted into borehole anchor points containing geotechnical physical properties. For three-dimensional geological model data, the geometric primitive information is analyzed, and continuous geological objects are transformed into discrete model anchor points through spatial grid sampling or centroid extraction. The borehole anchor points and model anchor points are stored together and a multi-dimensional spatial index is established. The data structure of each anchor point includes a globally unique index key, a planar coordinate vector, a vertical effective depth range, a standardized attribute vector, and metadata of the data source.

4. The regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, The geological knowledge graph in the knowledge base construction module is constructed in the following way: Define a geological ontology and instantiate the extracted geological objects as graph nodes; Analyze the spatial topology between geological bodies in a 3D geological model. If geometric surfaces intersect and have fault attributes, establish cutting semantic relationship edges. If the contact surface normal vector is detected to be continuous, establish an integrated contact semantic relation edge; Construct a heterogeneous graph network, utilize the neighborhood features of aggregated nodes using graph convolutional networks, calculate the Euclidean distance between text entity nodes and model entity nodes in the embedding space, and establish alignment links between cross-modal entities.

5. The regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, The knowledge base construction module generates composite index vectors in the following ways: Extract the basic semantic vectors of text slices using a pre-trained language model; A multi-scale sinusoidal position coding algorithm is used to map two-dimensional plane coordinates into high-dimensional space vectors; Map the absolute age value or relative sequence index value of geological time to a time feature vector; The components are dimensionally aligned using a learnable projection matrix, and a weighted concatenation strategy is used to fuse the basic semantic vector, high-dimensional space vector, and temporal feature vector into a composite index vector.

6. The regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, When generating the extended query vector, the reasoning retrieval module performs the following operations: Key geological entities are extracted from the query request as seed entities and anchored in the geological knowledge graph. Based on predefined geological element paths containing causal chains and attribute associations, multi-hop reasoning is performed starting from the seed entity to uncover hidden associated entities and add them to the extended entity set. A semantic encoder is used to generate the semantic vector of the original query, and the pre-trained embedding vector of each entity in the extended entity set is obtained. The weights are calculated based on the semantic relevance and the number of hops in the reasoning path, and the semantic vector of the original query and the vector of the extended entity set are weighted and fused.

7. The regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, The weighted spatiotemporal semantic scoring algorithm executed by the reasoning retrieval module includes: Calculate the semantic similarity between the extended query vector and the composite index vector of the candidate document slice; Calculate the Euclidean distance between the coordinates of the query center point and the coordinates of the spatial anchor points associated with the document slice, and calculate the spatial proximity based on the Gaussian kernel function; Extract the implied geological age or sequence index from the query and compare it with the geological attributes of the document slice to calculate the sequence matching degree. By linearly weighting and fusing the semantic similarity, spatial proximity, and stratigraphic sequence matching degree, a comprehensive matching score is obtained, and the recalled document slices are reordered accordingly.

8. A regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, The generation controller in the generation verification module is configured as follows: Receive the query request and the retrieved Top-K document slice set; The prompt word context is constructed according to a preset template. The prompt word context includes a system instruction area injected with anti-hallucination instructions, a factual basis area arranged by relevance score and assigned index identifiers, and a task generation area for filling in the query request. The context of the prompt words is input into a large language model, and the initial geological analysis report containing a reference index is generated through autoregressive decoding.

9. A regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, When performing the correction operation, the generation verification module is configured as follows: Key statements containing numerical values, location and attribute descriptions are extracted from the initial geological analysis report and transformed into geological assertion tuples, which contain spatial range, attribute name and generated value. Using the spatial range as the index key, a query is initiated to the fact anchor database to obtain the set of valid anchors within the corresponding range and to calculate the reference truth value. Calculate the relative error or semantic matching score between the generated value and the reference true value; If the relative error exceeds the preset tolerance threshold or the semantic matching score is lower than the threshold, a conflict is determined to exist; Replace the erroneous values ​​in the initial geological analysis report with the reference true values ​​or rewrite the relevant attribute descriptions, and add data source annotations.

10. A regional geological condition analysis system based on LLM and RAG technologies according to claim 1, characterized in that, The visualization module is configured as follows: Parse the coordinate set of all spatial points involved in the verified geological analysis report and calculate the axis-aligned bounding box; The optimal observation point position and line-of-sight target point of the virtual camera are calculated based on the axis-aligned bounding box. Using an octree or a three-dimensional R-Tree index structure, retrieve model data from a three-dimensional spatial database that has spatial intersection with the bounding box aligned with the axis; Based on the camera distance, model data at different levels of detail are dynamically scheduled, and an interactive mapping relationship is established between geological entity identifiers in text paragraphs and geometric primitive object identifiers in the 3D scene.