A chemical structure conformation prediction method based on a lightweight language model
By constructing a lightweight language model and introducing a low-rank adaptation module and a structured bias attention mask, combined with a physical-syntax dual-stream hybrid loss function and a graph neural network potential function, the problems of high computational resources and insufficient generation accuracy in existing technologies are solved, and efficient and accurate crystal structure prediction is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QUANZHOU INST OF EQUIP MFG
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245569A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of materials science and technology, and specifically to a method for predicting chemical structure configurations based on a lightweight language model. Background Technology
[0002] The chemical structure configuration described in this application specifically refers to a crystal structure. Crystal structure is fundamental information describing the relationship between the macroscopic properties and microscopic composition of materials. The accurate generation and prediction of crystal structures are of great significance in fields such as new material design, functional material screening, and material performance analysis. How to efficiently generate crystal structures that satisfy crystallographic laws under given chemical composition or constraints has always been a research focus in the field of materials informatics.
[0003] Existing methods for generating or predicting crystal structures mainly include physical model-based methods and data-driven methods. Physical model-based prediction methods typically rely on first-principles calculations or empirical potential functions, constructing energy functions and searching in a high-dimensional structure space to obtain stable structures. These methods generally ensure the physical plausibility of the generated structures, but the computational process is complex and costly, and the search efficiency decreases with increasing system complexity, making it difficult to meet the practical needs of high-throughput crystal structure generation.
[0004] As materials databases continue to expand, researchers have begun to introduce machine learning models for crystal structure prediction. These methods typically train existing crystal structure data to learn the statistical distribution of crystal structures and then use generative models to construct new candidate structures. While these methods reduce computational costs to some extent, many still rely on highly structured or specifically designed data representations, resulting in complex model designs. They also face significant challenges in handling constraints such as crystallographic symmetry and periodic boundaries, thus limiting their generalization ability and applicability.
[0005] In recent years, the development of pre-trained language models has provided new ideas for crystal structure generation. Some studies treat crystal structure description files as structured text sequences and use language models to autoregressively generate crystal structure parameters and atomic coordinates. This type of method can reduce explicit feature engineering and has strong expressive power and flexibility. However, most existing language model-based crystal structure generation techniques directly adopt general model structures and training strategies, lacking specific designs for the characteristics of crystal structure generation tasks. In practical applications, they still have the following shortcomings: First, the model's ability to perceive crystallographic physical constraints and parameter dependencies is limited, easily generating physically inconsistent structures; second, the generation accuracy of numerical parameters is difficult to guarantee; and third, the model training and fine-tuning process is highly dependent on computational resources, which is not conducive to deployment in low-resource environments.
[0006] Therefore, how to improve the modeling ability and numerical accuracy of crystallographic physical constraints in the crystal structure generation process while reducing the consumption of computing resources has become an urgent technical problem to be solved in the existing technology. Summary of the Invention
[0007] The purpose of this invention is to provide a chemical structure configuration prediction method based on a lightweight language model that improves prediction accuracy.
[0008] To achieve the above objectives, the present invention adopts the following technical solution: A method for constructing a lightweight language model for predicting chemical structure configurations, wherein the chemical structure configuration is a crystal structure, is disclosed. The lightweight language model is based on a pre-trained autoregressive language model with a parameter scale of less than 10 bytes. A low-rank adaptation module is introduced into the target linear transformation layer of the lightweight language model, and the rank of the low-rank adaptation module is dynamically allocated. The hierarchical adaptation function is used to dynamically define the rank of the module. Rank of the layer : ; in, This indicates the total number of Transformer blocks in the predefined lightweight language model. It is a shallow basic rank. For the deepest maximum rank, It is a non-linear growth coefficient. Represents the floor function; No. The formula for updating the linear layer weights of a layer is: ; in, Initialize as a zero matrix. Gaussian initialization is used. It is a scaling constant. In a lightweight language model, the first The original pre-trained weight matrix frozen in layers. This represents the weight matrix after low-rank adaptation update. A structured bias attention mask is introduced into the attention calculation of this lightweight language model, and the attention calculation of this lightweight language model is expressed by the following formula: ; in, The dimension of the key vector. Scaling factor These represent the query matrix, key matrix, and value matrix in the attention mechanism, respectively. This represents the normalization function, used to map the attention scores at each position to attention weights, and T represents the matrix transpose operation. For the total mask matrix, This represents the attention calculation function. = + , For standard causal mask, For crystal structure bias mask; For the currently generated number The token and the historically generated token Each token, its mask value The definition is as follows: ; in: This is the set of tokens corresponding to the atomic fraction coordinate values, i.e., the x, y, z coordinates of the atom currently being predicted. For the set of tokens corresponding to the lattice constant, This is the bias parameter for positive attention.
[0009] A method for predicting the configuration of a chemical structure based on a lightweight language model, wherein the chemical structure configuration is a crystal structure, includes the following steps performed sequentially: S1: Obtain training data, perform text processing on the training data, and obtain the CIF text of the training data; S2: Construct the CIF text into an instruction fine-tuning format and obtain the training dataset; S3: Input the training dataset into the lightweight language model constructed by the method described above for predicting chemical structure configurations, and perform supervised fine-tuning to obtain the fine-tuned lightweight language model. S4: Construct a physical-syntax dual-stream hybrid loss function for back-tuning the lightweight language model, and use this physical-syntax dual-stream hybrid loss function to adjust the lightweight language model to obtain a trained lightweight language model; S5: Input the chemical formula and the corresponding space group number into the trained lightweight language model to perform inference, obtain the predicted crystal structure, and the corresponding CIF file.
[0010] Preferably, the physical-syntax dual-stream hybrid loss function in step S4 It is expressed by the following formula: ; in, For balance coefficient, For text cross-entropy loss, T represents the total length of the target sequence. This represents the actual token at position t in the target sequence, where t represents the position index in the target sequence. This represents the conditional probability distribution function of the model given historical tokens. To measure the loss numerically, , It is the set of indices of all numerical tokens in the target sequence. It is the set of the K candidate tokens with the highest predicted probability by the model. This refers to a specific candidate token in the candidate token set. This represents the historical token sequence up to the t-th position. It is a numerical distance function: ; in, This indicates an operation that converts the token to a floating-point number. This is a non-numerical penalty constant.
[0011] Preferably, the prediction method further includes the following steps performed sequentially: S6: Based on the preset sampling temperature parameters, perform N independent inferences under the same conditions to obtain N independent and complete candidate predicted crystal structures and their corresponding CIF files; S7: Perform syntax rule filtering on the CIF file of the candidate predicted crystal structure; S8: The graph neural network potential function M3GNet is used as the physical scorer to score the predicted crystal structures filtered by this syntax rule. The crystal structure with the lowest predicted potential energy and the highest stability is obtained as the final output predicted crystal structure. This scoring process is expressed by the following formula: ; in, This represents the potential energy value predicted by the M3GNet model. This is the set of candidate predicted crystal structures obtained after grammatical filtering. Represents the first in the set One candidate predicted crystal structure, This indicates the operation for determining the minimum value.
[0012] Preferably, the syntax rule filtering in step S7 includes a CIF language parsing check and a consistency verification of elemental composition with the input chemical formula.
[0013] By adopting the aforementioned design scheme, the beneficial effects of the present invention are as follows: In this application, dynamic rank allocation is introduced when constructing a lightweight language model, which improves the model's ability to represent complex three-dimensional spatial group constraints, resolves the contradiction between parameter efficiency and geometric accuracy, and introduces a structured bias attention mask in attention calculation to increase the model's attention weight to physical constraints. When back-adjusting the lightweight language model, a numerical regression mechanism is introduced using a physical-syntactic dual-stream hybrid loss function. This guides the lattice constants and atomic coordinates generated by the lightweight language model to mathematically approximate the true values as closely as possible, thereby significantly reducing the geometric distortion rate of the generated structure and effectively improving prediction accuracy. Attached Figure Description
[0014] Figure 1 This is a schematic diagram of the architecture of the low-rank adapter module of the present invention; Figure 2 This is a schematic diagram of the structured bias attention mask of the present invention; Figure 3 This is an example of a training sample for the present invention; Figure 4 This is a flowchart illustrating the calculation of the physical-syntax two-stream hybrid loss function of the present invention. Figure 5 This is a flowchart illustrating the inference process of the lightweight language model of the present invention. Figure 6 An example of the CIF text of the crystal structure generated in this invention and its three-dimensional visualization diagram. Detailed Implementation
[0015] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this invention, and not all embodiments. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0016] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.
[0017] A method for constructing a lightweight language model for predicting chemical structure configurations, where the chemical structure configuration is a crystal structure, is presented. This lightweight language model is an improvement upon pre-trained autoregressive language models with a parameter scale of less than 10 bytes, such as Phi-3-Mini (3.8 bytes), Qwen2.5-7 bytes, and Llama-3.1-8 bytes. Figure 1 As shown, a low-rank adaptation module is introduced into the target linear transformation layer of this lightweight language model, and the rank of the low-rank adaptation module is dynamically allocated. The following hierarchical adaptation function is used to dynamically define the rank of the low-rank adaptation module. Rank of the layer : ; in, This indicates the total number of Transformer blocks in the predefined lightweight language model. It is a shallow basic rank. To achieve the deepest maximum rank, this embodiment takes... , ; For non-linear growth coefficients, when The time increases linearly; in this embodiment, we take... To accelerate deep capacity expansion. By introducing a nonlinear growth coefficient. The above function constitutes a nonlinear interpolation strategy that can achieve a smooth and non-uniform parameter transition between the rank boundaries of shallow and deep layers. This represents the floor function, used to map the real-valued rank calculated by the hierarchical fitness function to an integer rank; the The formula for updating the linear layer weights of a layer is: ; in, Initialize as a zero matrix. Gaussian initialization is used. This is the scaling factor. In a lightweight language model, the first The original pre-trained weight matrix frozen in layers. This represents the weight matrix after low-rank adaptation update; Because traditional low-rank adaptation (LoRA) techniques typically use a uniform rank across all Transformer layers... However, crystal structure generation tasks exhibit significant hierarchical heterogeneity: the shallow layers of lightweight language models are primarily responsible for capturing the basic grammatical rules of CIF files, such as keyword and symbol matching, which falls under low-dimensional pattern recognition; while the deep networks are responsible for constructing spatial topological relationships between atoms and unit cell constraints, which falls under high-dimensional geometric reasoning. The fixed-rank strategy leads to an excess of parameters in the shallow layers (prone to overfitting) and insufficient capacity in the deep layers (underfitting), limiting the prediction accuracy of lightweight language models for complex crystal structures. Therefore, a nonlinear interpolation strategy is used to maintain a lower rank (e.g., ...) when processing the shallow layer (primarily responsible for the text formatting specifications of CIF files) grammatical features. This involves suppressing overfitting through strong compression, ensuring that the generated CIF file contains no missing parentheses and that keywords are correct. When processing deep abstract geometric features (primarily responsible for understanding the spatial mapping relationships between atoms), the rank is dynamically increased (e.g., ...). This approach provides the model with a larger parameter space to fit complex nonlinear crystal geometries. This non-uniform distribution strategy significantly improves the model's ability to represent complex three-dimensional spatial group constraints with only a slight increase in the total number of trainable parameters, resolving the contradiction between parameter efficiency and geometric accuracy.
[0018] A structured bias attention mask is introduced into the attention calculation of this lightweight language model, and the attention calculation of this lightweight language model is expressed by the following formula: ; in, The dimension of the key vector. Scaling factor These represent the query matrix, key matrix, and value matrix in the attention mechanism, respectively. This represents the normalization function, used to map the attention scores at each position to attention weights, and T represents the matrix transpose operation. For the total mask matrix, This represents the attention calculation function. = + , For standard causal mask, For crystal structure bias mask; For the currently generated number The token and the historically generated token Each token, its mask value The definition is as follows: ; in: This is the set of tokens corresponding to the atomic fraction coordinate values, i.e., the x, y, z coordinates of the atom currently being predicted. For the set of tokens corresponding to the lattice constant, This is the bias parameter for positive attention.
[0019] like Figure 2 As shown, the dark gray area represents the causal mask, and the white area represents the standard attention computation area. The red highlighted area illustrates the structure-aware bias, i.e., the injection logic of the crystal structure bias mask: when the current generation step of the model is an atomic coordinate value (e.g., Query Token "0.25"), a positive bias is superimposed at the corresponding historical lattice constant value (e.g., Key Token "5.64"). This increases the model's attention weight to physical constraints in the attention calculation process on the right.
[0020] By injecting a positive bias before Softmax normalization This increases the attention weight of atomic coordinate tokens on lattice parameter tokens. This is equivalent to implanting the physical rule that "coordinates depend on the unit cell" into the neural network. When the model attempts to generate atomic coordinates, the attention mechanism focuses on the previously generated lattice parameters, rather than irrelevant descriptive text. This significantly enhances the consistency constraint between the generated geometry and the unit cell parameters at the algorithm level, greatly reducing the probability of generating unreasonable structures.
[0021] This embodiment also provides a method for predicting chemical structure configurations based on a lightweight language model, which uses the lightweight language model constructed using the above method.
[0022] A method for predicting the configuration of a chemical structure based on a lightweight language model, wherein the chemical structure configuration is a crystal structure, includes the following steps performed sequentially: S1: Obtain training data, process the training data into text, and obtain the CIF text of the training data; In order to convert the three-dimensional crystal structure into a sequence form that can be processed by a lightweight language model, the physical data first needs to be text-encoded.
[0023] This application uses the MP-20 dataset, constructed based on Materials Project data, as training data. Each crystal sample in MP-20 is stored as a structured object and can be serialized into a JSON representation via a standard interface. This structured object contains a unique identifier for the material, its chemical composition, space group information, thermodynamic properties, and a complete description of its crystal structure. The crystal structure is explicitly represented through lattice parameters, unit cell angles, and fractional coordinates of each atom, possessing a clear and complete physical semantics. However, this nested structured representation is not suitable for direct modeling and processing by lightweight language models.
[0024] To transform the crystal structure into an input representation suitable for lightweight language models, this application employs a crystal structure parsing tool to parse the crystal structure object in the original structured data. This tool is a common open-source software library capable of reading, manipulating, and converting crystallographic data formats, such as pymatgen. The original structured data refers to the underlying data format in which the crystal sample is stored in the computer, i.e., JSON data or object format with nested levels. This data is then exported as a standardized CIF text representation. During this process, key information such as lattice constants, cell angles, space group identifiers, and the element types and fractional coordinates of each atom are fully encoded into the CIF text. Subsequently, the CIF file is no longer used as a geometric structure object but is treated as a linear text sequence with strict syntactic and numerical constraints. Through this textualization process, the three-dimensional crystal structure, while fully preserving its geometric and symmetry information in a crystallographic sense, is transformed into a serialized representation suitable for autoregressive modeling in lightweight language models.
[0025] After the crystal structure was textualized, all samples were further formatted into the data format required for training the lightweight language model. Specifically, the samples were written to a CSV file for training, where the complete CIF text was stored as a multi-line string in the cif field, while material-related property information (such as chemical formula, space group number, etc.) was saved as a separate field. This CSV file can serve as a direct data source for the supervised fine-tuning phase, and the lightweight language model no longer encounters the original structured JSON data or geometric representation during training.
[0026] S2: Construct the CIF text into an instruction fine-tuning format, preferably ChatML format, and obtain the training dataset. In this embodiment, each training sample consists of three parts: system prompts, user input, and model output. The system prompts define the crystallography expert role and constrain the generation rules; the user input provides specific material condition information, including chemical formula and space group number; the model output is the corresponding textual representation of the crystal structure. Under this setting, the crystal structure generation task is uniformly described as the problem of generating a sequence of CIF texts that conform to the crystallography syntax rules under given material condition constraints, such as... Figure 3 The image shown is an example of a training sample for this application.
[0027] S3: Input the training dataset into the lightweight language model constructed by the above method for predicting chemical structure configurations, and perform supervised fine-tuning to obtain the fine-tuned lightweight language model. S4: Construct a physical-syntax dual-stream hybrid loss function for back-tuning the lightweight language model, and use this physical-syntax dual-stream hybrid loss function to adjust the lightweight language model to obtain a trained lightweight language model; in this embodiment, as... Figure 4 As shown, the physical-syntax two-stream hybrid loss function in step S4 It is expressed by the following formula: ; in, For balance coefficient, For text cross-entropy loss, T represents the total length of the target sequence. This represents the actual token at position t in the target sequence, where t represents the position index in the target sequence. This represents the conditional probability distribution function of the model given historical tokens. To measure the loss numerically, , It is the set of indices of all numerical tokens in the target sequence. It is the set of the K candidate tokens with the highest predicted probability by the model. This refers to a specific candidate token in the candidate token set. This represents the historical token sequence up to the t-th position. It is a numerical distance function: ; in, This indicates an operation that converts the token to a floating-point number. This is a non-numerical penalty constant.
[0028] This application's physics-grammar dual-stream hybrid loss function introduces a numerical regression mechanism. It can identify extremely close values (5.59 and 5.58) and thus impose a smaller penalty on them, while imposing a larger penalty on non-numerical predictions. This gives the lightweight language model the ability to perceive numerical magnitudes, moving beyond rote memorization of numerical symbols to understanding the continuity of physical quantities. (Distance metric loss) The lightweight language model receives gradients pointing in the direction of smaller numerical errors during backpropagation. This is equivalent to embedding a regression mechanism in the text generation task, guiding the lattice constants and atomic coordinates generated by the lightweight language model to mathematically approximate the true values as closely as possible, thereby significantly reducing the geometric distortion rate of the generated structure (such as cell parameter errors and anomalous atomic spacing).
[0029] S5: Input the chemical formula and the corresponding space group number into the trained lightweight language model to perform inference, obtain the predicted crystal structure, and the corresponding CIF file.
[0030] In this implementation, such as Figure 5 As shown, the inference process of the trained lightweight language model is as follows: the input chemical formula and space group number are presented as prompts in a structured format and then input into the trained lightweight language model. Based on the currently accumulated content, the trained lightweight language model predicts the probability distribution of the next token and samples a token from this distribution. The sampled token is then added to the crystal structure content being generated. This process is iteratively executed until a predefined termination condition is met. The predicted crystal structure and its corresponding CIF file can be generated through an autoregressive approach. Visualization of this CIF file using tools such as VESTA yields results such as... Figure 6 The three-dimensional crystal structure shown allows for a visual inspection of the geometric rationality of the generated structure in terms of atomic arrangement and unit cell parameters.
[0031] As a preferred approach, the prediction method further includes the following steps performed sequentially: S6: Based on the preset sampling temperature parameters, in this embodiment, the temperature parameters are set to 0.7-1.0, and N independent inferences are performed under the same conditions to obtain N independent and complete predicted crystal structures and corresponding CIF files; in this embodiment, N = 10, and the sampling temperature parameter is set to 0.9, which guides the model to explore more extensively in the probability space. This process significantly improves the configuration diversity of candidate structures and effectively alleviates the problem of easily getting trapped in local optima in a single generation.
[0032] S7: Perform syntax rule filtering on the CIF file of the predicted crystal structure; this syntax rule filtering includes CIF syntax resolvability checks and consistency verification of elemental composition with the input chemical formula. Any candidate sample that cannot be resolved into a valid CIF structure, or whose elemental ratio does not match the given conditions, will be directly eliminated, thereby ensuring that subsequent physical evaluation is performed only on valid candidates with reasonable structure and composition.
[0033] S8: A graph neural network potential function (M3GNet) is used as the physical scorer to score the predicted crystal structures filtered by this syntax rule. Based on the principle of energy minimization, the crystal structure with the lowest predicted potential energy and the highest stability is selected as the final output predicted crystal structure. This scoring process is expressed by the following formula: ; in, This represents the potential energy value predicted by the M3GNet model. This is the set of candidate predicted crystal structures obtained after grammatical filtering. Represents the first in the set One candidate predicted crystal structure, The minimum value is represented by the operation used to determine the candidate predicted crystal structure with the lowest potential energy from the set of candidate predicted crystal structures as the final output predicted crystal structure. By introducing the M3GNet potential energy evaluation, high-energy unstable structures in the generation process are effectively eliminated, making the final output result approximate the thermodynamic ground state with the accuracy of DFT calculation.
[0034] Through the above process, the system can automatically select the best predicted crystal structure from multiple generated predicted crystal structures based on the principle of minimizing energy, and use it as the final output. This screening mechanism significantly reduces the predicted potential energy of the final output structure without introducing a global search or iterative energy optimization process, thereby effectively improving the thermodynamic stability, physical rationality, and overall reliability of the crystal structure prediction results.
[0035] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for constructing a lightweight language model for predicting chemical structure configurations, wherein the chemical structure configuration is a crystal structure, characterized in that: A lightweight language model is constructed, which is based on a small and medium parameter scale pre-training autoregressive language model below 10B, a low-rank adaptive module is introduced in a target linear transformation layer of the lightweight language model, and the rank of the low-rank adaptive module is dynamically allocated, and the following hierarchical adaptive function is used to dynamically define the rank of the low-rank adaptive module layer : ; in, This indicates the total number of Transformer blocks in the predefined lightweight language model. It is a shallow basic rank. For the deepest maximum rank, It is a non-linear growth coefficient. Represents the floor function; No. The formula for updating the linear layer weights of a layer is: ; in, Initialize as a zero matrix. Gaussian initialization is used. It is a scaling constant. In a lightweight language model, the first The original pre-trained weight matrix frozen in layers. This represents the weight matrix after low-rank adaptation update; A structured bias attention mask is introduced into the attention calculation of this lightweight language model, and the attention calculation of this lightweight language model is expressed by the following formula: ; in, The dimension of the key vector. Scaling factor These represent the query matrix, key matrix, and value matrix in the attention mechanism, respectively. This represents the normalization function, used to map the attention scores at each position to attention weights, and T represents the matrix transpose operation. For the total mask matrix, This represents the attention calculation function. = + , For standard causal mask, For crystal structure bias mask; For the currently generated number The token and the historically generated token Each token, its mask value The definition is as follows: ; in: This is the set of tokens corresponding to the fractional coordinates of the atoms, i.e., the x, y, z coordinates of the atom currently being predicted. For the set of tokens corresponding to the lattice constant, This is the bias parameter for positive attention.
2. A method for predicting chemical structure configuration based on a lightweight language model, wherein the chemical structure configuration is a crystal structure, characterized in that: The steps are as follows, performed sequentially: S1: Obtain training data, perform text processing on the training data, and obtain the CIF text of the training data; S2: Construct the CIF text into an instruction fine-tuning format and obtain the training dataset; S3: Input the training dataset into the lightweight language model constructed by the lightweight language model construction method for predicting chemical structure configuration as described in claim 1 above, and perform supervised fine-tuning to obtain the fine-tuned lightweight language model. S4: Construct a physical-syntax dual-stream hybrid loss function for back-tuning the lightweight language model, and use this physical-syntax dual-stream hybrid loss function to adjust the lightweight language model to obtain a trained lightweight language model; S5: Input the chemical formula and the corresponding space group number into the trained lightweight language model to perform inference, obtain the predicted crystal structure, and the corresponding CIF file.
3. The chemical structure configuration prediction method based on a lightweight language model as described in claim 2, characterized in that: The physical-syntax two-stream hybrid loss function in step S4 It is expressed by the following formula: ; in, For balance coefficient, For text cross-entropy loss, T represents the total length of the target sequence. This represents the actual token at position t in the target sequence, where t represents the position index in the target sequence. This represents the conditional probability distribution of the model given historical tokens. To measure the loss numerically, , It is the set of indices of all numerical tokens in the target sequence. It is the set of the K candidate tokens with the highest predicted probability by the model. This refers to a specific candidate token in the candidate token set. This represents the historical token sequence up to the t-th position. It is a numerical distance function: ; in, This indicates an operation that converts the token to a floating-point number. This is a non-numerical penalty constant.
4. The chemical structure configuration prediction method based on a lightweight language model as described in claim 3, characterized in that: The prediction method also includes the following steps performed sequentially: S6: Based on the preset sampling temperature parameters, perform N independent inferences under the same conditions to obtain N independent and complete candidate predicted crystal structures and their corresponding CIF files; S7: Perform syntax rule filtering on the CIF file of the candidate predicted crystal structure; S8: The graph neural network potential function M3GNet is used as the physical scorer to score the predicted crystal structures filtered by this syntax rule. The crystal structure with the lowest predicted potential energy and the highest stability is obtained as the final output predicted crystal structure. This scoring process is expressed by the following formula: ; in, This represents the potential energy value predicted by the M3GNet model. This is the set of candidate predicted crystal structures obtained after grammatical filtering. Represents the first in the set One candidate predicted crystal structure, This indicates the operation for determining the minimum value.
5. The chemical structure configuration prediction method based on a lightweight language model as described in claim 4, characterized in that: The syntax rule filtering in step S7 includes CIF language parsing checks and consistency verification of elemental composition with the input chemical formula.