Method for predicting molecular structure based on two-core one-dimensional nuclear magnetic resonance raw spectrum

By directly utilizing the preprocessing of the original spectral sequences of hydrogen and carbon nuclei and cross-nuclear attention updates, the problem of insufficient information utilization in existing methods is solved, and more stable and unified molecular structure prediction is achieved.

CN122245485APending Publication Date: 2026-06-19NANJING TECH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING TECH UNIV
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing methods rely on artificial peak tables or intermediate features, failing to effectively utilize the continuous peak shapes and weak peak details of the original NMR spectra. Furthermore, the modeling of the complementary relationship between hydrogen and carbon nuclei spectra is insufficient, resulting in inadequate molecular structure prediction capabilities.

Method used

The original spectral sequences of hydrogen and carbon nuclei are directly obtained, preprocessed and encoded, grounded resampling is performed using learnable query vectors, and unified spectral memory sequences and global spectral vectors are constructed through intranuclear self-attention and cross-nuclear bidirectional cross-attention updates. These are then input into a cross-modal molecular structure prediction network for joint prediction.

Benefits of technology

It reduces the information loss from peak surface processing, enhances the utilization of complementary information between proton and carbon spectra, and improves the uniformity and stability of molecular structure prediction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245485A_ABST
    Figure CN122245485A_ABST
Patent Text Reader

Abstract

This invention discloses a molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance (NMR) raw spectra. The method directly acquires the raw spectral sequences of hydrogen and carbon nuclei, performs preprocessing, raw spectrum encoding, and fixed-query resampling respectively, to obtain dual-core grounded latent variables. Within the latent variable space, it further performs intranuclear self-attention updates and cross-nuclear bidirectional cross-attention updates to obtain dual-core inference latent variables. Based on real chemical shift information, it constructs chemical shift-assisted monitoring targets. Then, it constructs a unified spectral memory sequence and a global spectral vector from the dual-core inference latent variables, and inputs them into a cross-modal molecular structure prediction network. This network jointly performs spectrum-molecule comparison, spectrum-molecule matching, and molecular structure generation, outputting candidate molecular structures. This invention improves the utilization of continuous information from raw spectra, the modeling ability of complementary dual-core information, and the unified prediction capability of molecular structures.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of nuclear magnetic resonance (NMR) spectrum analysis, intelligent prediction of molecular structure, and cross-modal generation modeling technology, specifically to a method for predicting molecular structure based on a dual-core one-dimensional NMR raw spectrum. Background Technology

[0002] Nuclear magnetic resonance (NMR) spectra are an important basis for molecular structure analysis.

[0003] Existing structure prediction methods based on NMR spectra mostly extract peak tables, peak position lists, chemical shift lists, or other artificially constructed intermediate features from the spectrum, and then use rule retrieval, discriminant models, or generative models to infer the structure. These methods typically struggle to directly utilize continuous peak shapes, weak peak details, peak width variations, and peak overlap relationships in the original spectrum.

[0004] On the other hand, there are significant differences between one-dimensional NMR spectra of hydrogen nuclei and one-dimensional NMR spectra of carbon nuclei in terms of peak shape, peak width, information density, and local structural significance. Simply splicing the two types of spectra together is insufficient to simultaneously preserve both mononuclear details and cross-nuclear complementarity within a unified characterization space.

[0005] Furthermore, existing cross-modal methods often only focus on spectrum-to-structure discrimination or only on conditional generation, lacking a unified modeling mechanism that can simultaneously take into account spectrum-molecule global consistency, sample pairing discrimination ability, and structure generation ability. Summary of the Invention

[0006] I. Purpose of the Invention

[0007] The purpose of this invention is to provide a molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra, in order to solve the problems of existing methods relying on artificial peak tables or intermediate features, insufficient utilization of continuous information in the raw spectrum, insufficient modeling of dual-core complementary relationships, and insufficient ability to predict molecular structures from spectra.

[0008] II. Technical Solution

[0009] To achieve the above objectives, the present invention adopts the following technical solution:

[0010] First, the original spectral sequences of hydrogen and carbon nuclei are directly obtained, and the original spectra are preprocessed and encoded. Then, a fixed number of learnable query vectors are used to perform ground resampling on the dense spectral features to obtain dual-core ground latent variables.

[0011] Then, perform intra-kernel self-attention update and cross-kernel bidirectional cross-attention update within the latent variable space to obtain dual-kernel inference latent variables;

[0012] The dual-core inference latent variables are then constructed into a unified spectral memory sequence and a global spectral vector. Finally, the spectral memory sequence and the global spectral vector are input into a cross-modal molecular structure prediction network to jointly perform spectrum-molecule comparison, spectrum-molecule matching, and molecular structure generation, and output one or more candidate molecular structures.

[0013] Specifically, the molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra of the present invention includes the following steps:

[0014] S1. Obtain the original one-dimensional nuclear magnetic resonance (NMR) spectrum sequence of hydrogen nuclei and the original one-dimensional nuclear magnetic resonance (NMR) spectrum sequence of carbon nuclei corresponding to the sample to be tested, and perform preprocessing.

[0015] S2, the original spectrum sequences of the hydrogen nucleus one-dimensional nuclear magnetic resonance (NMR) and the original spectrum sequences of the carbon nucleus one-dimensional NMR are encoded to obtain the corresponding hydrogen spectrum dense feature sequences and carbon spectrum dense feature sequences.

[0016] S3, using a fixed number of learnable query vectors to perform ground resampling on the hydrogen spectrum dense feature sequence and the carbon spectrum dense feature sequence respectively, to obtain hydrogen nucleus ground latent variables and carbon nucleus ground latent variables;

[0017] S4, perform intranuclear self-attention update and cross-nuclear bidirectional cross-attention update on the hydrogen nucleus grounding latent variable and the carbon nucleus grounding latent variable to obtain the hydrogen nucleus inference latent variable and the carbon nucleus inference latent variable;

[0018] S5, construct a chemical shift auxiliary monitoring target based on the real chemical shift information, and perform chemical shift auxiliary monitoring on the spectral representation based on the chemical shift auxiliary monitoring target;

[0019] S6, construct a unified spectral memory sequence from the hydrogen nucleus inference latent variables and the carbon nucleus inference latent variables, and construct a global spectral vector based on the hydrogen nucleus inference latent variables and the carbon nucleus inference latent variables;

[0020] S7. Input the spectral memory sequence and the global spectral vector into the cross-modal molecular structure prediction network, and jointly perform spectral-molecule comparison, spectral-molecule matching and molecular structure generation to output one or more candidate molecular structures.

[0021] Further:

[0022] The preprocessing in step S1 includes at least orientation unification, median centering, and maximum absolute value normalization of the original spectral sequence.

[0023] The original spectral encoding in step S2 includes convolutional expansion, multi-scale convolutional block, positional embedding injection, and self-attention encoding.

[0024] The proton and carbon branches use different sets of multi-scale convolution kernels. The convolution kernel sets for the proton branch are 17, 65, and 129, while those for the carbon branch are 21 and 81. The convolution block stride is 8.

[0025] The length of the proton spectrum dense feature sequence and the carbon spectrum dense feature sequence obtained in step S2 are both 1250, and the feature dimension is 512.

[0026] Further:

[0027] The fixed number of learnable query vectors mentioned in step S3 are proton spectrum query vectors and carbon spectrum query vectors, with 128 proton spectrum query vectors and 128 carbon spectrum query vectors.

[0028] The ground resampling in step S3 reads information from the corresponding dense feature sequence through a multi-layer cross-attention and feedforward network; preferably, after three layers of ground resampling, the hydrogen nucleus ground latent variable and the carbon nucleus ground latent variable are output.

[0029] Further:

[0030] The in-nuclear self-attention update in step S4 includes performing self-attention and feedforward network updates on the hydrogen nucleus grounding latent variable and the carbon nucleus grounding latent variable, respectively.

[0031] The cross-nuclear bidirectional cross-attention update in step S4 includes cross-attention updates where hydrogen latent variables read information from carbon latent variables and cross-attention updates where carbon latent variables read information from hydrogen latent variables.

[0032] In step S4, the cross-nuclear bidirectional cross-attention update introduces a gating residual coefficient to control the cross-nuclear information injection amplitude; preferably, after two layers of latent space inference, the hydrogen nucleus inference latent variables and carbon nucleus inference latent variables are output.

[0033] Further:

[0034] The chemical shift auxiliary monitoring targets in step S5 include peak presence target map, peak density target map, and position shift target map.

[0035] In step S5, a dense chemical shift prediction head and a detailed chemical shift prediction head are set respectively. The dense chemical shift prediction head is applied to the dense feature sequence, and the detailed chemical shift prediction head is applied to the latent variable representation.

[0036] In step S5, the detailed chemical shift prediction head generates peak presence and position shift predictions based on a mixture of latent variable representation and position grid, and in a preferred embodiment, it operates on the hydrogen nucleus inference latent variable and the carbon nucleus inference latent variable.

[0037] Further:

[0038] The spectral memory sequence in step S6 is formed by splicing hydrogen nucleus inference latent variables and carbon nucleus inference latent variables, and adds nucleus type embedding and query number embedding;

[0039] The global spectral vector is obtained by performing attention pooling on the hydrogen nucleus inference latent variables and carbon nucleus inference latent variables respectively, concatenating the results and linearly mapping them;

[0040] The global spectrogram vector is not inserted into the spectrogram memory sequence.

[0041] A molecular structure prediction device based on a dual-core one-dimensional nuclear magnetic resonance raw spectrum using the method of the present invention includes:

[0042] The data acquisition module is used to acquire the raw one-dimensional nuclear magnetic resonance (NMR) spectrum sequences of hydrogen nuclei and carbon nuclei.

[0043] The spectrum encoding module is used to preprocess and encode the original spectrum sequence to obtain the hydrogen spectrum dense feature sequence and the carbon spectrum dense feature sequence;

[0044] The resampling module is used to obtain the hydrogen nucleus grounding latent variable and the carbon nucleus grounding latent variable based on a fixed number of learnable query vectors;

[0045] The latent variable inference module is used to obtain latent variables for hydrogen nucleus inference and latent variables for carbon nucleus inference.

[0046] The spectral memory construction module is used to construct a unified spectral memory sequence and a global spectral vector;

[0047] The structure prediction module is used to jointly perform spectrum-molecule comparison, spectrum-molecule matching and molecular structure generation, and output one or more candidate molecular structures.

[0048] A computer program written based on the method of the present invention is stored in a readable storage medium. An apparatus for implementing the method includes a processor and the readable storage medium (memory); when the processor executes the computer program, the method of the present invention is implemented.

[0049] III. Beneficial Effects

[0050] Compared with the prior art, the present invention has the following beneficial effects:

[0051] First, the original spectrum is used directly as the main input, reducing the information loss caused by peak table processing;

[0052] Second, by compressing long spectral sequences with fixed-capacity latent variables, the burden of cross-modal modeling is reduced;

[0053] Third, enhance the utilization of complementary information between proton and carbon spectra through dual-core latent space reasoning;

[0054] Fourth, by using a unified spectral memory to drive comparison, matching, and generation, molecular structure prediction can achieve better uniformity and stability. Attached Figure Description

[0055] Figure 1 This is an overall structural diagram of the molecular structure prediction method according to an embodiment of the present invention.

[0056] Figure 2 This is a structural diagram of the information extraction module according to an embodiment of the present invention.

[0057] Figure 3 This is the original spectral encoding and chemical shift auxiliary monitoring structure diagram of an embodiment of the present invention. Detailed Implementation

[0058] The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

[0059] I. Input Definition and Preprocessing

[0060] Let the batch size be B, and the original one-dimensional NMR spectra of hydrogen and carbon nuclei be denoted as X, respectively. H and X C .

[0061] To unify the spectral orientation and stabilize the numerical range, the original spectrum is preprocessed as shown in equations (1) to (3) before data construction or model input:

[0062]

[0063] Equation (1) unifies the spectral direction through Reverse(·);

[0064] Equation (2) uses the median(·) function to center the median, which is used to suppress baseline shift;

[0065] Equation (3) uses the max(·) function to achieve maximum absolute value normalization, which is used to control the input scale.

[0066] Where x represents the input spectrum sequence, and ε is a very small positive number introduced to prevent the denominator from being zero during the normalization process.

[0067] By unifying the orientation, centering the median, and normalizing the amplitude, the impact of inconsistent spectral orientation, baseline shift, and amplitude scale differences on the subsequent coding process can be reduced.

[0068] In the preferred embodiment, the above-mentioned directional unification and normalization processing has been completed in the data construction stage.

[0069] II. Original Spectrum Encoding

[0070] The proton and carbon spectra are encoded using separate branches. Each branch consists of an expand convolutional layer, a multi-scale convolutional block layer, a positional embedding layer, and a self-attention encoder. After convolutional expansion, an intermediate representation with expanded channel count is obtained. Then, multi-scale block division is completed through one-dimensional convolution with different kernel sizes to form a dense spectral feature sequence. The process is shown in equations (4) to (6):

[0071]

[0072] Equation (4) performs convolution expansion operations on the original spectrum using ConvExpand(·).

[0073] Equation (5) performs block operations on the proton and carbon spectra using Patcher_H(·) and Patcher_C(·), respectively.

[0074] Equation (6) performs encoding operations on the proton spectrum branch and the carbon spectrum branch using Encoder_H(·) and Encoder_C(·), respectively;

[0075] LN(T H ) and LN(T C ) represent the layer normalization operations for the proton and carbon spectrum branches, respectively;

[0076] Pos(g H ) and Pos(g C ) represent the Fourier position embeddings constructed according to the sequence lengths of the proton and carbon spectra, respectively;

[0077] T H,enc and T C,enc These represent the dense characteristic sequences of the hydrogen spectrum and carbon spectrum, respectively.

[0078] In a preferred embodiment, the multi-scale convolution kernel set of the proton spectrum branch is 17, 65, or 129, the multi-scale convolution kernel set of the carbon spectrum branch is 21 or 81, and the convolution block step size is 8.

[0079] Because one-dimensional nuclear magnetic resonance spectra of hydrogen nuclei typically have a large number of peaks, dense peak shapes, and richer local detail variations, the hydrogen spectrum branch uses a multi-scale convolutional kernel that simultaneously contains both small and large receptive fields to balance local shape modeling of narrow peaks with modeling of inter-peak correlations over a longer range.

[0080] One-dimensional NMR spectra of carbon nuclei have relatively fewer peaks and sparser peak shapes. Therefore, different combinations of convolution kernels are used in carbon spectrum branches to balance local peak response extraction and global contour characterization.

[0081] Setting the convolution block step size to 8 can effectively compress the sequence length while preserving the main structural information of the spectrogram, thereby reducing the computational burden of subsequent self-attention encoding and latent variable inference.

[0082] After encoding, the length of both the proton spectrum dense feature sequence and the carbon spectrum dense feature sequence is 1250, and the feature dimension is 512.

[0083] III. Grounding Resampling and Fixed Capacity Latent Variables

[0084] In order to compress long spectral sequences into a fixed-capacity representation, this invention sets a fixed number of learnable query vectors for both proton and carbon spectra, and reads information from dense feature sequences through cross-attention to form grounded latent variables.

[0085] In this invention, "grounding" means that a fixed number of learnable query vectors do not evolve independently of the spectrum, but rather read information from the dense spectral feature sequence of the corresponding nuclear species through cross-attention, thereby establishing a correspondence between the latent variable representation and the actual spectral content.

[0086] In implementation, the ground resampling block uses a two-stage residual update process involving layer normalization, cross-attention, and a feedforward network. The process is shown in equations (7) to (9):

[0087]

[0088] Where LN(·) represents layer normalization operation; MHA(·) represents multi-head attention mechanism; FFN(·) represents feedforward neural network;

[0089] Q l This is the latent variable query representation at level l, Q. l ~ indicates that the query at level l represents the intermediate result after the cross-attention update is completed.

[0090] The operation of Equation (7) is to use the l-th layer query representation as the query term, perform cross-attention reading on the dense spectral features, and obtain the intermediate latent variable representation aligned with the spectral content.

[0091] The operation of Equation (8) is to further perform feedforward network and residual update based on the cross-attention update result to enhance the nonlinear expressive power of latent variables.

[0092] The operation of Equation (9) is to normalize the final query representation after multi-layer grounding resampling to obtain the hydrogen nucleus grounding latent variable and the carbon nucleus grounding latent variable respectively.

[0093] Z H,ground and Z C,ground Represent the grounding latent variables for hydrogen and carbon nuclei, respectively; Lr represents the number of grounding resampling layers, Q H Lr and Q C Lr These represent the proton and carbon spectral branches, respectively, at the completion of L... r The final query representation updated by the layer grounding resampling serves as the input for subsequent normalization and latent variable construction.

[0094] In the preferred embodiment, the number of query vectors for the hydrogen spectrum and carbon spectrum are both 128, and the number of ground resampling layers is 3, thereby obtaining the hydrogen nucleus grounding latent variable Z. H,ground and carbon nucleus grounding latent variable Z C,ground .

[0095] IV. Latent Variable Space Reasoning

[0096] After obtaining the dual-core grounding latent variables, this invention does not simply concatenate the two. Instead, it first performs intra-core self-attention updates separately, and then performs cross-core bidirectional cross-attention updates to obtain the dual-core inference latent variables. In the current implementation, the cross-core interaction part uses gated residual coefficients to control the information injection amplitude.

[0097] The process is as shown in equations (10) to (15):

[0098]

[0099] Where SelfBlock(·) represents the kernel update module consisting of self-attention and a feedforward network, α H and α C Learnable gating parameters for the proton and carbon spectral branches, respectively;

[0100] Among them, Z H,self and Z C,self These represent the results of the hydrogen nucleus grounding potential variable and the carbon nucleus grounding potential variable after being updated by intranuclear self-attention, respectively.

[0101] g H and g C These represent the gating coefficients for the hydrogen and carbon nuclear branches, respectively, used to control the intensity of cross-nuclear information injection;

[0102] Z H,cross and Z C,cross These represent the intermediate latent variables after the cross-core attention update is completed;

[0103] Z H,reason and Z C,reason These represent the latent variables of hydrogen nucleus inference and carbon nucleus inference obtained after cross-nucleus inference, respectively.

[0104] clamp(·) represents a truncation operation that restricts the input to a preset range, used to stabilize the range of gating coefficients.

[0105] The operation of Equation (10) is to perform intranuclear self-attention updates on the grounding latent variables of hydrogen nuclei and carbon nuclei respectively, so as to enhance the global dependency modeling within each nucleus species.

[0106] Equation (11) maps the learnable gating parameters to gating coefficients within a limited range to adjust the intensity of cross-core information injection.

[0107] The operation of equation (12) is to perform a cross-nuclear cross update of the hydrogen nucleus latent variable reading information from the carbon nucleus latent variable.

[0108] The operation of Equation (13) is to further complete the nonlinear feature transformation through the feedforward network on the basis of cross-core cross-update, and obtain the latent variable representation after inference.

[0109] The operation of equation (14) is to perform a cross-nuclear cross update that reads information from the hydrogen nucleus latent variable.

[0110] The operation of Equation (15) is to further complete the nonlinear feature transformation through a feedforward network on the basis of cross-core cross-update, so as to obtain the carbon core inference latent variable representation.

[0111] In a preferred embodiment, the number of latent variable space inference layers is 2.

[0112] If a certain nuclear modality is missing, the corresponding latent variables and subsequent memory masks can be masked based on the modality presence flag.

[0113] V. Chemical Shift-Assisted Monitoring

[0114] Based on the actual chemical shift information, a peak presence target map, a peak density target map, and a position shift target map can be constructed, and the spectral representation can be further supervised by a dense chemical shift prediction head and a detailed chemical shift prediction head, respectively.

[0115] The process is as shown in equations (16) to (21):

[0116]

[0117] Among them, y max This represents a peak presence target map, used to characterize whether a true peak exists at each location; y sumσ represents the peak density target map, used to characterize the superimposed distribution of the true peak responses near each location; σ represents the smoothing scale parameter used to construct the target response distribution; i represents the index of the discrete location in the spectrum. k This indicates the center position of the actual chemical shift in the token index space;

[0118] `logit_dense` represents the peak presence prediction value output by the dense chemical shift prediction head, and `delta_dense` represents the position offset output by the dense prediction head; L dense nuc L represents the dense prediction loss corresponding to a certain nucleus species. detail nuc This represents the detailed prediction loss corresponding to this kernel type; L peak (·) represents the peak existence loss term, L off (·) represents the peak position offset loss term, L multi (·) indicates the target image y based on peak density. sum Density monitoring loss term, L cons nuc This represents the consistency constraint loss between dense and detailed prediction heads in the distribution of peak presence.

[0119] L ppm nuc L represents the total loss of chemical shift auxiliary monitoring for a specific nuclear species. ppm H and L ppm C L represents the chemical shift auxiliary monitoring loss of the hydrogen nucleus and carbon nucleus, respectively. ppm represents the total loss of dual-core chemical shift-assisted supervision; w_dense, w_detail, and w_cons represent the weight coefficients of dense prediction loss, detailed prediction loss, and consistency constraint term, respectively; nuc represents the kernel type identifier, and center_idx represents the center index of the location obtained by mapping the true peak position.

[0120] The operation of equation (16) is to construct a target map of the existence of peak positions.

[0121] The operation of equation (17) is to construct a peak density target map.

[0122] The operation of equation (18) is to define the single-core dense prediction loss.

[0123] Equation (19) defines the single-core detailed prediction loss.

[0124] The operation of equation (20) is to define the total loss of chemical shift auxiliary monitoring for mononuclear species.

[0125] The operation of Equation (21) is to summarize the auxiliary monitoring losses of hydrogen nuclei and carbon nuclei to obtain the total auxiliary monitoring loss of dual nuclei.

[0126] In a preferred embodiment, the dense chemical shift prediction head operates on the dense feature sequence, the detailed chemical shift prediction head operates on the latent variable representation, and generates peak presence and position shift predictions based on the mixing of position grids and latent variables; in a preferred embodiment, the detailed chemical shift prediction head reads the inference latent variables.

[0127] In chemical shift-assisted supervision tasks, existing prediction head forms can be used, including position-wise regression heads based on convolutional layers, classification or regression heads based on multilayer perceptrons, and sequence prediction heads based on attention mechanisms.

[0128] In this embodiment, to match the characteristics of one-dimensional NMR spectra, which simultaneously possess continuous spectral shape information and key peak position information, a dual-head structure of a dense chemical shift prediction head and a detailed chemical shift prediction head is adopted. Specifically, the dense chemical shift prediction head performs position-level prediction on the dense spectral feature sequence to maintain the local continuous structure and overall peak distribution information of the spectrum; the detailed chemical shift prediction head performs fine-grained peak presence and position shift prediction on the latent variable representation to further enhance the key peak position information in the compressed representation space. Through this dual-head structure, both continuous spectral contour representation and accurate modeling of key peak positions can be simultaneously considered, thereby improving the stability and effectiveness of chemical shift-assisted supervision.

[0129] VI. Unified Spectral Memory Construction

[0130] To facilitate use by downstream cross-modal networks, this invention constructs a unified spectrogram memory sequence from the dual-core inference latent variables, while simultaneously constructing a separate global spectrogram vector. The memory used for cross-modal cross-attention consists only of dual-core inference latent variable tokens, without inserting explicit global tokens.

[0131] The process is as shown in equations (22) to (26):

[0132]

[0133] in,

[0134] E type This indicates the kernel type embedding, used to distinguish between proton spectrum tokens and carbon spectrum tokens;

[0135] E qid This indicates that the query number is embedded to identify each resampling slot;

[0136] Concat(·) means to merge;

[0137] Score(·) represents the attention scoring function learned;

[0138] softmax(·) represents the activation function;

[0139] Where i and j represent the token indices in the hydrogen nucleus spectrum memory segment and the carbon nucleus spectrum memory segment, respectively; W cls [p H ;p C ] indicates that the hydrogen nucleus global pooling vector p is used. H With the global pooling vector p of carbon nuclei C After concatenation, the global spectrogram vector spec_cls is obtained through linear mapping.

[0140] spec_cls represents the global spectrum vector used by the spectrum-molecule comparison task.

[0141] The operation of equation (22) is to splice the hydrogen nucleus inference latent variable and the carbon nucleus inference latent variable along the sequence dimension to form a unified spectral memory sequence.

[0142] The operation of equation (23) is to inject kernel type embedding and query number embedding into the unified spectral memory sequence.

[0143] Equation (24) involves calculating attention weights for the hydrogen nucleus spectrum memory segment and the carbon nucleus spectrum memory segment, respectively.

[0144] The operation of Equation (25) is to perform weighted pooling on the hydrogen nucleus spectrum memory segment and the carbon nucleus spectrum memory segment according to the attention weight, so as to obtain the hydrogen nucleus global pooling vector and the carbon nucleus global pooling vector.

[0145] Equation (26) involves concatenating and linearly mapping the dual-core global pooling vectors to obtain the global spectrogram vector spec_cls.

[0146] In a preferred embodiment, if a certain nuclear species mode is missing, the spectral memory segment corresponding to that nuclear species is turned off in the mask, and the corresponding pooling result is set to zero.

[0147] VII. Cross-modal molecular structure prediction network

[0148] After obtaining the unified spectral memory sequence and the global spectral vector, they are input into a cross-modal encoder-decoder network with shared parameters. The cross-modal encoder-decoder network includes a text encoding path, a spectral-text fusion encoding path, and a conditional generation decoding path. The unified spectral memory sequence serves as the memory input for cross-modal cross-attention, used to inject spectral conditional information into the molecular representation, while the global spectral vector spec_cls is used for global consistency modeling in the spectral-molecular contrast task.

[0149] In the prior art, cross-modal encoder-decoder networks that can be used include encoder-decoder architectures based on Transformer, cross-modal generative networks based on visual language pre-trained models, and multimodal generative architectures that use shared encoders and conditional decoders.

[0150] In this embodiment, to match the representation of a one-dimensional NMR spectrum after original spectrum encoding, ground resampling, latent variable spatial inference, and unified spectrum memory construction, a cross-modal encoder-decoder network with shared parameters is employed. Specifically, the unified spectrum memory sequence serves as the memory input for cross-modal cross-attention, used to inject spectrum conditional information into the molecular representation; the global spectrum vector spec_cls is used for global consistency modeling in the spectrum-molecule comparison task.

[0151] Building upon this foundation, the network further includes a spectrum-molecule comparison head, a spectrum-molecule matching classification head, and a structure generation branch. The spectrum-molecule comparison head measures the similarity between the spectrum representation and the molecular representation in the shared embedding space. The spectrum-molecule matching classification head determines whether the spectrum and molecule constitute a true pair. The structure generation branch generates SELFIES or SMILES sequences autoregressively under spectrum constraints, thereby outputting candidate molecular structures. Through these settings, a unified spectrum memory sequence and a global spectrum vector can simultaneously serve the comparison, matching, and generation tasks, enabling cross-modal joint structure prediction for one-dimensional NMR spectra.

[0152] Based on this, the cross-modal representations output by the encoder-decoder network are fed into the spectrum-molecule comparison head, the spectrum-molecule matching classification head, and the structure generation branch, respectively. The spectrum-molecule comparison head is used to measure the similarity between the spectrum representation and the molecular representation in the shared embedding space; the spectrum-molecule matching classification head is used to determine whether the spectrum and the molecule constitute a true pair; the structure generation branch is used to autoregressively generate SELFIES or SMILES sequences under the constraints of the spectrum, thereby outputting candidate molecular structures.

[0153] Combination Figure 1 As shown in the overall structure, the output of the cross-modal molecular structure prediction network is further fed into the spectrum-molecule comparison head, the spectrum-molecule matching classification head, and the structure generation branch, respectively.

[0154] Among them, the spectrum-molecule comparison head performs comparative learning based on the similarity between the global spectrum vector spec_cls and the global molecular representation. This is used to enhance the global consistency constraint between the spectrum representation and the molecular representation in the shared embedding space, making real paired samples closer in the embedding space and non-real paired samples more separated in the embedding space.

[0155] The spectral-molecule matching classification head is based on cross-modal fusion representation under spectral constraints. It classifies whether the input spectrum and candidate molecules constitute a true pair, thereby improving the model's ability to distinguish between positive and negative sample pairings.

[0156] The structure generation branch uses conditional information provided by the unified spectral memory sequence and employs an autoregressive decoding method to gradually generate molecular structure string representations. The molecular structure string representations are preferably SELFIES or SMILES sequences, thereby outputting one or more candidate molecular structures.

[0157] Therefore, the spectrum-molecule comparison head mainly undertakes the function of global consistency modeling, the spectrum-molecule matching classification head mainly undertakes the function of pairing relationship discrimination, and the structure generation branch mainly undertakes the function of conditional molecular structure generation. Together, they constitute a cross-modal joint prediction mechanism for dual-core one-dimensional nuclear magnetic resonance raw spectra.

[0158] The loss function is expressed as:

[0159]

[0160] Among them, L sg For the structural generation loss, L smc For spectrum-molecule contrast loss, L smm For spectral-molecular matching loss, L ppm λ is used as an auxiliary monitoring loss for chemical shift. sg , λ smc , λ smm and λ ppm These represent the weighting coefficients corresponding to structure generation loss, spectrum-molecule contrast loss, spectrum-molecule matching loss, and chemical shift-assisted supervision loss, respectively.

[0161] The current standard prediction model uses a joint training approach, simultaneously optimizing four tasks: structure generation, spectrum-molecule comparison, spectrum-molecule matching, and chemical shift-assisted supervision.

[0162] VIII. Predicted Output

[0163] During the inference phase, the original hydrogen nucleus spectrum sequence and the original carbon nucleus spectrum sequence corresponding to the sample to be tested are input into the model of this invention. After original spectrum encoding, ground resampling, latent variable space inference and unified spectrum memory construction, the structure generation branch is called to output the molecular structure string representation.

[0164] In a preferred embodiment, the molecular structure strings are represented as SELFIES (Self-Referencing EmbeddedStrings) or SMILES (Simplified molecular input line entry system), and can be further converted into molecular graph structures or one or more candidate molecular structures.

Claims

1. A method for predicting molecular structure based on dual-core one-dimensional nuclear magnetic resonance raw spectra, characterized in that, Includes the following steps: S1. Obtain the original one-dimensional nuclear magnetic resonance (NMR) spectrum sequence of hydrogen nuclei and the original one-dimensional nuclear magnetic resonance (NMR) spectrum sequence of carbon nuclei corresponding to the sample to be tested, and preprocess the original spectrum sequence. S2, the original one-dimensional nuclear magnetic resonance spectrum sequences of hydrogen nuclei and carbon nuclei processed by S1 are encoded by independent hydrogen spectrum branches and carbon spectrum branches respectively, to obtain the corresponding hydrogen spectrum dense feature sequences and carbon spectrum dense feature sequences. S3, using a fixed number of learnable query vectors to perform ground resampling on the hydrogen spectrum dense feature sequence and the carbon spectrum dense feature sequence respectively, to obtain hydrogen nucleus ground latent variables and carbon nucleus ground latent variables; S4, perform intranuclear self-attention update and cross-nuclear bidirectional cross-attention update on the hydrogen nucleus grounding latent variable and the carbon nucleus grounding latent variable to obtain the hydrogen nucleus inference latent variable and the carbon nucleus inference latent variable; S5, construct chemical shift auxiliary monitoring targets based on real chemical shift information, and perform chemical shift auxiliary monitoring on the spectral representation based on the chemical shift auxiliary monitoring targets; S6, construct a unified spectral memory sequence from the hydrogen nucleus inference latent variables and the carbon nucleus inference latent variables, and construct a global spectral vector based on the hydrogen nucleus inference latent variables and the carbon nucleus inference latent variables; S7, input the spectral memory sequence and the global spectral vector into the cross-modal molecular structure prediction network; and jointly perform spectral-molecule comparison, spectral-molecule matching and molecular structure generation to output one or more candidate molecular structures.

2. The molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra according to claim 1, characterized in that, In step S1, the preprocessing steps include: unifying the orientation of the original spectral sequence, centering the median, and normalizing the maximum absolute value.

3. The molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra according to claim 1, characterized in that, In step S2, the original spectral encoding steps include: convolutional expansion, multi-scale convolutional block, positional embedding injection, and self-attention encoding; The convolution kernel sets for the proton spectrum branch are 17, 65, and 129; the convolution kernel sets for the carbon spectrum branch are 21 and 81; the convolution block step size is 8 for all branches. The lengths of the obtained proton and carbon dense feature sequences are both 1250, and the feature dimensions are both 512.

4. The molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra according to claim 1, characterized in that, In step S3, The fixed number of learnable query vectors are hydrogen spectrum query vectors and carbon spectrum query vectors; The ground resampling method reads information from the corresponding dense feature sequence through a multi-layer cross-attention and feedforward network, and outputs hydrogen nucleus ground latent variables and carbon nucleus ground latent variables after 3 layers of ground resampling. The number of lookup vectors for the proton spectrum is 128, and the number of lookup vectors for the carbon spectrum is 128.

5. The molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra according to claim 1, characterized in that, In step S4, first perform intra-kernel self-attention updates, then perform cross-kernel bidirectional cross-attention updates to obtain dual-kernel inference latent variables; Intranuclear self-attention updates include: performing self-attention and feedforward network updates on the grounding latent variables of hydrogen nuclei and carbon nuclei, respectively; The cross-nuclear bidirectional cross-attention update includes: cross-attention update where hydrogen nucleus latent variables read information from carbon nucleus latent variables and cross-attention update where carbon nucleus latent variables read information from hydrogen nucleus latent variables; and a gating residual coefficient is introduced to control the cross-nuclear information injection amplitude. After two layers of latent space inference, hydrogen nucleus inference latent variables and carbon nucleus inference latent variables are output.

6. The molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra according to claim 1, characterized in that, In step S5, Chemical shift-assisted monitoring targets include: peak presence target map, peak density target map, and position shift target map; Dense chemical shift prediction head and detailed chemical shift prediction head are set up respectively. The dense chemical shift prediction head is applied to dense feature sequences, and the detailed chemical shift prediction head is applied to latent variable representations. The detailed chemical shift prediction head generates peak presence and position shift predictions based on a hybrid latent variable representation and position grid, and applies to hydrogen nucleus inference latent variables and carbon nucleus inference latent variables.

7. The molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra according to claim 1, characterized in that, In step S6, The spectral memory sequence is formed by splicing hydrogen nucleus inference latent variables and carbon nucleus inference latent variables, and adds nucleus type embedding and query number embedding; The global spectral vector is obtained by performing attention pooling on the hydrogen nucleus inference latent variables and carbon nucleus inference latent variables respectively, concatenating them and linearly mapping them.

8. The molecular structure prediction method based on dual-core one-dimensional nuclear magnetic resonance raw spectra according to claim 1, characterized in that, In step S7, The unified spectral memory sequence serves as the memory input for cross-modal cross-attention, used to inject spectral conditional information into the molecular representation; the global spectral vector is used for global consistency modeling in the spectral-molecule comparison task. The cross-modal coding-decoding network includes a text coding path, a spectrogram-text fusion coding path, and a conditional generation decoding path; The spectrum-molecule comparison, spectrum-molecule matching, and molecular structure generation are respectively achieved by the spectrum-molecule comparison head, the spectrum-molecule matching classification head, and the structure generation branch; The cross-modal representations output by the encoder-decoder network are fed into the spectrum-molecule comparison head, the spectrum-molecule matching and classification head, and the structure generation branch, respectively. The spectrum-molecule comparison head is used to measure the similarity between spectrum representations and molecular representations in the shared embedding space; the spectrum-molecule matching classification head is used to determine whether a spectrum and a molecule constitute a true pair. The structure generation branch is used to autoregressively generate SELFIES or SMILES sequences under spectral constraints, thereby outputting candidate molecular structures.