Device and method for generating crystal structure information on basis of artificial intelligence model
The AI model system efficiently integrates image and text data to enhance crystal structure generation, addressing the limitations of existing methods by improving accuracy and reducing experimental costs.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- POSCO HLDG INC
- Filing Date
- 2025-12-17
- Publication Date
- 2026-06-25
Smart Images

Figure KR2025022012_25062026_PF_FP_ABST
Abstract
Description
Device and method for generating crystal structure information based on artificial intelligence models
[0001] The present disclosure relates to a technology for generating crystal structure information based on an artificial intelligence model.
[0002] Artificial intelligence technology is a technology that enables computers to perform various tasks, such as reasoning and problem-solving, by training them to learn from data; its usefulness is being newly highlighted alongside advancements in computer computational power and data processing technology.
[0003] An artificial intelligence model is the product of such AI learning, and among them, generative AI refers to a model equipped with the ability to generate new content, going beyond the analysis and prediction of existing data. In particular, language models have high applicability in relation to existing academic and knowledge systems, as they utilize human language for both input and output.
[0004] In addition, regarding crystal structure generation technology, since discovering new materials generally requires numerous experiments and a significant amount of time and cost, generating crystal structure information using artificial intelligence technology can increase efficiency in aspects such as discovering a large number of candidate materials or more easily predicting synthesis conditions.
[0005] In addition, since crystal structure generation technology involves various academic fields such as materials science, chemistry, and polymer engineering, it requires learning a high level of specialized knowledge and knowledge regarding complex three-dimensional structures. Therefore, text-based language models alone have limitations, and there is a need for the development of technology for artificial intelligence models that can generate crystal structure information by learning specialized knowledge that includes both images and text.
[0006] The present disclosure aims to provide an artificial intelligence model-based crystal structure information generation device capable of generating crystal structure information.
[0007] In one aspect, the present embodiments may provide an artificial intelligence model-based crystal structure information generation device comprising: a data preprocessing unit that constructs an interleaved dataset by arranging text and images so as to intersect when the contents of text and images included in the raw dataset correspond to each other; a visual encoder that generates image feature data from an image; a language model that generates text encoding data from text; a projector that converts image feature data and text encoding data into data in a shareable form; a learning unit that learns the relationship between the image and text based on the converted data, initializes the projector, and then performs fine-tuning to simultaneously learn condition learning and filling learning methods that fill in masked parts, thereby generating information satisfying specific conditions for the language model and the projector based on the interleaved dataset; and an information generation unit that generates at least one crystal structure information based on prompt input.
[0008] In another aspect, the present embodiments may provide a method for generating crystal structure information based on an artificial intelligence model, comprising: a data preprocessing step for constructing an interleaved dataset by arranging text and images so as to intersect when the contents of text and images included in a raw dataset correspond to each other; a learning step for performing fine-tuning by simultaneously learning condition learning to generate information satisfying specific conditions and filling learning to fill masked parts based on the interleaved dataset, and a step for generating at least one crystal structure information based on prompt input.
[0009] According to the present disclosure, an artificial intelligence model-based crystal structure information generating device for generating crystal structure information can be provided.
[0010] FIG. 1 is a block diagram of an artificial intelligence model-based crystal structure information generation device according to the present disclosure.
[0011] FIG. 2 is a graph for exemplarily illustrating dataset collection according to one embodiment.
[0012] FIG. 3 is a diagram illustrating, in an exemplary manner, the content of performing learning based on an interleaved dataset according to one embodiment.
[0013] FIG. 4 is a diagram illustrating, in an exemplary manner, the simultaneous performance of condition learning and filling learning derived from crystal structure data based on an interleaved dataset according to one embodiment.
[0014] FIG. 5 is a diagram illustrating, in an exemplary manner, the content of performing crystal structure information generation learning according to one embodiment.
[0015] FIG. 6 is a diagram illustrating, in an exemplary manner, a configuration for generating crystal structure information based on a condition prompt according to one embodiment.
[0016] FIG. 7 is a diagram illustrating, in an exemplary manner, a configuration for generating crystal structure information based on a fill prompt according to one embodiment.
[0017] FIG. 8 is a diagram illustrating, in an exemplary manner, the configuration of a raw dataset according to one embodiment.
[0018] FIG. 9 is a diagram illustrating, in an exemplary manner, a configuration for constructing a raw dataset into an interleaved dataset according to one embodiment.
[0019] FIG. 10 is a diagram illustrating, in an exemplary manner, a configuration for categorizing and classifying each image of an interleaved dataset according to one embodiment.
[0020] FIG. 11 is a diagram illustrating, by way of example, a configuration for constructing an interleaved dataset for sub-images according to one embodiment.
[0021] FIG. 12 is a flowchart relating to a method for generating crystal structure information based on an artificial intelligence model according to the present disclosure.
[0022] FIG. 13 is a flowchart relating to a dataset preprocessing step according to one embodiment.
[0023] FIG. 14 is a flowchart relating to a learning step according to one embodiment.
[0024] FIG. 15 is a flowchart relating to a multimodal learning step according to one embodiment.
[0025] Hereinafter, some embodiments of the present disclosure will be described in detail with reference to the exemplary drawings. In assigning reference numerals to the components of each drawing, the same components may have the same reference numeral as much as possible, even if they are shown in different drawings. Furthermore, in describing the embodiments, if it is determined that a detailed description of related known components or functions may obscure the essence of the technical concept, such detailed description may be omitted. Where terms such as "comprising," "having," or "consisting of" are used in this specification, other parts may be added unless "only" is used. Where a component is expressed in the singular, it may include a plural unless otherwise specified.
[0026] Additionally, terms such as first, second, A, B, (a), (b), etc., may be used to describe the components of the present disclosure. These terms are used merely to distinguish the components from other components, and the nature, order, sequence, or number of the components are not limited by such terms.
[0027] In describing the positional relationship of components, where it is stated that two or more components are "connected," "combined," or "joined," it should be understood that while the two or more components may be directly "connected," "combined," or "joined," they may also be "connected," "combined," or "joined" with other components "intervened." Here, the other components may be included in one or more of the two or more components that are "connected," "combined," or "joined" with one another.
[0028] In describing the temporal flow relationship regarding components, methods of operation, or methods of production, for example, when the temporal or sequential relationship is described using "after," "following," "next," or "before," it may include cases where the relationship is not continuous unless "immediately" or "directly" is used.
[0029] Meanwhile, where numerical values or corresponding information regarding a component (e.g., levels, etc.) are mentioned, even without separate explicit notation, the numerical values or corresponding information may be interpreted as including a range of error that may occur due to various factors (e.g., process factors, internal or external shocks, noise, etc.).
[0030] FIG. 1 is a block diagram of an artificial intelligence model-based crystal structure information generation device according to the present disclosure.
[0031] Referring to FIG. 1, the artificial intelligence model-based crystal structure information generation device (100) according to the present disclosure may include a data preprocessing unit (110), a learning unit (120), and an information generation unit (130). The data preprocessing unit (110), the learning unit (120), and the information generation unit (130) may be connected to each other.
[0032] For example, an artificial intelligence model-based crystal structure information generation device (100) may include a data preprocessing unit (110) that constructs an interleaved dataset by arranging text and images so that they intersect when the contents of text and images included in a raw dataset correspond to each other, a visual encoder that generates image feature data from an image, a language model that generates text encoding data from text, a projector that converts image feature data and text encoding data into data that can be shared, a learning unit (120) that learns the relationship between an image and text based on the converted data, and an information generation unit (130) that generates at least one crystal structure information based on prompt input.
[0033] The data preprocessing unit (110) can be connected to other components and can transmit and receive data. The data preprocessing unit (110) can transmit and receive data at any point in time or at a fixed interval. The data transmitted and received by the data preprocessing unit (110) may include not only individual data but also a dataset, which is a set of data.
[0034] For example, the data preprocessing unit (110) may receive a raw dataset. In this case, the raw dataset may be collected in a form generated or transmitted based on user input, or it may be collected through a separate data collection unit.
[0035] For example, the raw dataset may include at least one of images and text. Additionally, the raw dataset may contain images and text mixed together, or may be collected separately. Such a raw dataset will be described in more detail below in FIG. 2.
[0036] For example, the data preprocessing unit (110) can preprocess the data included in the raw dataset and transmit the preprocessed data to at least one of the learning unit (120) and the information generation unit (130).
[0037] For example, the data preprocessing unit (110) can construct a new dataset by preprocessing data containing raw datasets. For example, if the content of text and images included in the raw dataset corresponds to each other, an interleaved dataset can be constructed by arranging the text and images so that they intersect each other. In this case, the raw dataset may be in a form where text data and image data are separated.
[0038] As such, the preprocessing step of interleaving related text and images in the correct positions can provide several advantages in training language models such as LLMs (Large Language Models) or multimodal models such as LMMs (Large Multimodal Models).
[0039] For example, in the case of language models based on sequence modeling, interleaved preprocessed text and images can provide more natural sequences, which can help LLMs or LMMs learn not only the individual content of multimodal data but also its context.
[0040] Furthermore, interleaved datasets can also aid in jointly learning contextual relationships between text and images. For instance, given that text following an image is likely to contain a description of that image, and that if the content of adjacent or close images is related, the corresponding text is also likely to be related, preprocessing the raw dataset into an interleaved dataset can be helpful in learning concepts that encompass common content among multiple related images and their corresponding texts.
[0041] For example, the data preprocessing unit (110) can generate interleaved data by inserting an image based on the position where the image is first mentioned in the text. And through this method, the raw dataset can be constructed into an interleaved dataset.
[0042] For example, the data preprocessing unit (110) can insert an image based on the position of the sentence where the text is separated into individual sentences. For example, interleaved preprocessing can be performed by inserting the image between the sentence where the image is first mentioned and the preceding sentence. Alternatively, the image can be inserted between the sentence where the image is first mentioned and the following sentence. Details regarding this are explained in more detail below in FIGS. 8 and 9.
[0043] For example, the data preprocessing unit (110) can label images based on pre-set image category classification criteria. In this case, image category classification for each image may be performed with the image and text separated from each other, or may be performed on an interleaved dataset in which corresponding images and text are arranged to intersect.
[0044] For example, the data preprocessing unit (110) can perform image category classification and image labeling using a pre-trained language model. Alternatively, it can perform image category classification and image labeling using a separate image category classification device.
[0045] For example, the image category classification criteria may include at least one of charts / graphs, diagrams, microscopic photographs, macroscopic photographs, simulation images, map images, and experimental images. In some cases, if it is determined that there are images not included in the above image categories, those images may be classified to be included in a separate "Other Category" by additionally establishing it.
[0046] Generally, the content of each image and its meaning in relation to the associated text can vary depending on how the image is classified. Consequently, even text regarding the same subject may be interpreted differently to some extent if the corresponding image belongs to a different category.
[0047] In contrast, regarding the relationship between a specific image and the content of the corresponding text, even if the subject of the image and text match, accurate learning is difficult to perform if the category of the image does not match the content of the text.
[0048] In such cases, additional preprocessing can be performed during the training of the AI model, such as adding cases where image categories and text content do not match as data filtering conditions, adjusting the loss function regarding mismatched data, or modifying related weights. Through this, noisy data can be filtered out more accurately during the learning of image-text relationships, thereby improving the efficiency and accuracy of the training.
[0049] Below, we describe the explanations for each category and the content related to the training of artificial intelligence models.
[0050] 1) Charts / Graphs
[0051] This invention relates to images of charts and graphs, including dot graphs, bar graphs, and line graphs, concerning the visualization of quantitative data. These images can be utilized for learning data regarding the types and content of data to be represented, as well as the relationships and rates of change between data.
[0052] 2) Diagrams
[0053] This relates to images that explain operation or configuration methods by simplifying or symbolically representing systems, processes, or structures. It can be utilized for learning data regarding the configuration and operation methods of specific systems or processors shown in the images.
[0054] 3) Microscopic photographs
[0055] This relates to images captured using devices such as microscopes that reveal details difficult to see with the naked eye. These images and the corresponding text can be utilized to learn relevant knowledge about the microscopic world.
[0056] 4) Macroscopic photographs
[0057] It concerns images of objects or scenes visible to the naked eye. It can be utilized for learning related to knowledge and visual analysis of images through the subjects appearing in the images and the content of the corresponding text.
[0058] 5) Simulates Images
[0059] This relates to images generated to model, predict, or explain theoretical scenarios, processes, or phenomena. It can be utilized to learn knowledge for understanding simulation content or results displayed in the images, and for analyzing their alignment with scenario models or correlations.
[0060] 6) Maps
[0061] This relates to images that visually represent data regarding the geographical area or environment of a specific region. It can be utilized for learning purposes, such as identifying geographical information presented in the images and analyzing their meaning based on context.
[0062] 7) Experimental Images (Experimental Results Visualizations)
[0063] Regarding scientific experiments, this relates to images that visualize the experimental process or its results. For example, it may include images related to X-ray diffraction analysis (XRD), solid-state reaction experiments, gel electrophoresis experiments, polymerase chain reaction (PCR) experiments, etc., and can be utilized for learning purposes, such as analyzing content related to the experimental results shown in the images.
[0064] 8) CIF String (Character Input Format)
[0065] It is a string format used for training language models. For example, it is used to process input at the character level, enabling language models to learn patterns at the character level to improve accuracy, and can be utilized to resolve issues that arise when inputting text data into language models.
[0066] For example, the data preprocessing unit (110) can classify and label image categories for each sub-image when each image contains at least one sub-image. In this case, the image category of the entire image and the image category of the sub-image included therein may not be the same, and accordingly, image category labeling may be performed separately for each.
[0067] For example, if the image category of Figure001, which is the entire image, is classified as Type 2 (diagram) and the image category of Sub001, which is a sub-image of Figure001, is classified as Type 7 (experimental image), then the image category of Figure001 can be labeled as F001_Type02 and the image category of Sub001 can be labeled as F001_S001_Type07.
[0068] As such, details regarding the configuration for classifying and labeling each image based on image category classification criteria are explained in more detail below in Fig. 10.
[0069] For example, the data preprocessing unit (110) can construct an interleaved dataset by arranging at least one sub-image included in the image and a sub-text corresponding to the sub-image so as to intersect.
[0070] For example, the data preprocessing unit (110) may arrange each sub-text corresponding to each sub-image so as to intersect each other when a specific image and a specific text correspond and the image contains at least one sub-image. In this case, the sub-image may be inserted based on the position where each sub-image is first mentioned within each sub-text.
[0071] For example, the data preprocessing unit (110) can identify an image index within the text using a pre-set regular expression matching function. In addition, the location of the identified image index within the text can be determined, and the corresponding image can be inserted based on the determined location.
[0072] For example, the data preprocessing unit (110) can identify sub-image indices within text using a regular expression matching function when at least one sub-image exists in a specific image. In this case, the regular expression matching function can identify image indices and sub-image indices based on different identification criteria.
[0073] In this way, details regarding the construction of an interleaved dataset by arranging corresponding sub-images and sub-texts to intersect each other are explained in more detail below in Fig. 6.
[0074] For example, the data preprocessing unit (110) can construct an interleaved dataset based on two or more raw datasets. In this case, different interleaved datasets can be constructed based on each raw dataset, and as a result, there may also be two or more interleaved datasets.
[0075] For example, the data preprocessing unit (110) can construct a first interleaved dataset based on a first raw dataset and construct a second interleaved dataset based on a second raw dataset. The first interleaved dataset and the second interleaved dataset can then be used to perform first learning and second learning, respectively, in the learning unit (120). Through this, by performing learning with different scopes and purposes, learning can be performed that complements aspects that are difficult to achieve with each individual learning alone.
[0076] As a specific example, the raw dataset may include general domain data and scientific paper data. In some cases, the general domain data and scientific paper data may each take the form of independent datasets.
[0077] In this case, the data preprocessing unit (110) can construct a first interleaved dataset based on general domain data and a second interleaved dataset based on scientific paper data.
[0078] In some cases, general domain data can help improve the general reasoning and answer generation capabilities of artificial intelligence models across a wide range of fields, regardless of whether they are academic or non-academic. In contrast, scientific paper data can help improve scientific reasoning and answer generation capabilities by learning more specialized scientific knowledge and utilizing it.
[0079] Furthermore, when general domain data and scientific paper data each contain both image and text data, constructing them as distinct interleaved datasets can enable broader inference and answer generation in learning the relationships between images and text, and can reduce the possibility of errors occurring in the process.
[0080] For example, when interleaved preprocessing and training are performed solely based on scientific paper data, compared to when interleaved preprocessing and training are performed together with general domain data, the accuracy of the AI model's inference and answers may be relatively lower when questions or requests for explanation on prompts include content related to the general domain.
[0081] As another example, when interleaved preprocessing and training are performed solely based on general domain data, compared to when interleaved preprocessing and training are performed together with scientific paper data, the accuracy of the AI model's inference and answers may be relatively lower when questions or requests for explanation include content related to scientific knowledge.
[0082] In contrast, when interleaved preprocessing and learning based on general domain data and interleaved preprocessing and learning based on scientific paper data are performed respectively, more accurate inference and answers can be achieved even if questions or requests for explanation on prompts include content related to both the general domain and scientific knowledge.
[0083] For example, the data preprocessing unit (110) may generate training data necessary for performing training in the learning unit (120) based on an interleaved dataset. In this case, the training data may include image captioning training data and VQA training data, and such training data may be generated within the artificial intelligence model-based decision structure generation device according to the present disclosure. Alternatively, training data may be generated using a separate training data generation device.
[0084] For example, image captioning training data may include specific images and their descriptions contained in an interleaved dataset as ground truth. For instance, if the interleaved dataset is constructed based on scientific paper data, ground truth may be generated based on the body text of the corresponding paper. In some cases, ground truth may be generated based on the abstract of the corresponding paper.
[0085] For example, VQA training data may include questions and answers regarding specific images contained in an interleaved dataset as ground truth. For instance, if the interleaved dataset is constructed based on scientific paper data, the ground truth data may be generated based on the captions for the corresponding images in the papers. In some cases, VQA training data may be generated based on main caption data for a specific main image, or based on sub-caption data for sub-images included within that specific main image.
[0086] And in some cases, the data preprocessing unit (110) can generate a benchmark dataset that can evaluate the inference and answering ability of an artificial intelligence model based on an interleaved dataset. In this case, the benchmark dataset can be generated within a range that includes the content of the interleaved dataset but does not overlap with the content of the training data.
[0087] For example, the benchmark dataset may also be used to evaluate the performance of artificial intelligence models other than the artificial intelligence model included in the artificial intelligence model-based decision structure information generation device according to the present disclosure.
[0088] The learning unit (120) can be connected to at least one of the data preprocessing unit (110) and the information generation unit (130) and can perform a learning operation based on given data. For example, the learning unit (120) can perform a learning operation based on a raw dataset or an interleaved dataset.
[0089] For example, the learning unit (120) can produce new data based on given data, and can re-produce the produced data using other data or preset values. The produced data can then be used for learning.
[0090] For example, the learning unit (120) may perform either supervised learning, which learns using data in which the correct answer is predetermined, or unsupervised learning, which learns to extract meaningful information based on data in which the correct answer is not determined, and in some cases, may perform both types of learning.
[0091] For example, the learning unit (120) includes a visual encoder that generates image feature data from an image, a language model that generates text encoding data from text, and a projector that converts the image feature data and text encoding data into data of the same dimension, and can learn the relationship between the image and the text based on the converted data.
[0092] For example, the learning unit (120) can perform fine-tuning to train the language model and the projector based on an interleaved dataset after initializing the projector. For example, fine-tuning may include simultaneously performing condition learning to generate information satisfying specific conditions and filling learning to fill in masked parts. And in some cases, the visual encoder may train the language model and the projector while remaining in a fixed state without participating in the training.
[0093] For example, a visual encoder can generate image feature data based on an input image. In this case, the image may be an image included in at least one of a raw dataset or an interleaved dataset.
[0094] For example, a visual encoder can extract image features including colors, textures, boundaries, objects, and backgrounds identified from an image, and convert the extracted features into numerical vectors.
[0095] For example, a visual encoder may include at least one of an encoder that extracts image features based on a Transformer, an encoder that extracts image features based on a Convolutional Neural Network (CNN), and an encoder that extracts image features based on a Graph Neural Network (GNN). In some cases, the visual encoder may include an encoder based on two or more artificial intelligence computation techniques.
[0096] For example, it may include visual encoders such as ConVit and CLIP (Contrastive Language-Image Pre-training), where both CNN and Transformer technologies can be utilized. In some cases, among CLIP, the visual encoder of a specific model, such as the pre-trained CLIP ViT-L / 14-336, may be primarily used.
[0097] For example, a visual encoder is not limited to a configuration that performs image encoding, but may also include a configuration that performs text encoding in some cases. In this case, image encoding and text encoding may each be performed by different encoders.
[0098] For example, a visual encoder is not limited to a configuration that performs data encoding, but may also include a configuration that learns input data in some cases. Additionally, a pre-trained visual encoder, in which such learning has already been performed in advance, can be used.
[0099] For example, a language model can generate text-encoded data based on input text. In this case, the text may be text included in at least one of a raw dataset or an interleaved dataset.
[0100] For example, a language model can perform a tokenization operation that separates input text into tokens, an embedding operation that converts each token into a numerical vector, and an encoding operation that concatenates the embedded data to convert them into a vector sequence.
[0101] For example, a language model can learn text associated with an image by referring to image feature data generated by a visual encoder. In this case, the image feature data and text encoding data are converted into data of the same dimension by a projector, and then used to learn the relationship between the image and the text based on the converted data.
[0102] For example, a projector can convert image feature data and text encoding data into a shareable data format. In this case, both image feature data and text encoding data may be converted, or only the image feature data may be converted.
[0103] For example, a projector can establish a common low-dimensional space where both images and text can be projected, and then transform image feature data and text encoding data, respectively, based on this space. As another example, the projector can transform image feature data based on the dimensions of the text encoding data.
[0104] For example, the projector may include a deep learning-based projector. For example, it may include a multi-layer perceptron (MLP) projector based on multiple hidden layers.
[0105] For example, the learning unit (120) may include a configuration that performs specific actions only on some of the visual encoder, language model, and projector. For example, it may include a configuration that performs an initialization action only on the projector, or a configuration that performs a learning action only on the language model and projector while keeping the visual encoder fixed and not learning.
[0106] For example, the learning unit (120) can perform crystal structure information generation learning, in which the language model learns to generate crystal structure information in the form of a CIF string containing unit cell information and atomic information.
[0107] For example, the learning unit (120) can perform crystal structure information generation learning with at least one of a visual encoder, a language model, and a projector having been pre-trained. For example, the learning unit (120) can perform crystal structure information generation learning with the language model and the projector having been pre-trained based on at least one of general domain data and scientific paper data.
[0108] For example, the learning unit (120) can perform a first learning based on a first interleaved dataset built based on general domain data among raw datasets, and can perform a second learning based on a second interleaved dataset built based on scientific paper data. And in some cases, the learning unit (120) can perform a learning for generating crystal structure information after performing the first learning and the second learning.
[0109] For example, the learning unit (120) can perform crystal structure information generation learning to generate crystal structure information based on input data. In this case, the crystal structure information can be generated in a form that includes the structure of the crystal and location information of the included elements.
[0110] For example, crystal structure information can be generated in the form of a CIF string, which is a text-based representation of the crystal structure. For example, the crystal structure information includes information regarding the lateral lengths (l1, l2, l3) and angles (θ1, θ2, θ3) of a unit cell, and element symbols e for the N atoms contained in the unit cell. i and coordinate information (x i , y i , zi It may include ).
[0111] For example, the learning unit (120) can perform learning to generate crystal structure information based on input data including at least one of an image and text. In this case, the input data may include data in the form of an image or text containing information of the crystal structure to be generated.
[0112] For example, the learning unit (120) may simultaneously perform condition learning to generate crystal structure information regarding a crystal structure satisfying input crystal structure conditions, and filling learning to generate crystal structure information by filling in masked parts of an input crystal structure string. In this case, the crystal structure conditions may include at least one of information regarding various conditions of a crystal structure to be generated, such as lateral length and angle information of a unit cell, atomic number and coordinate information of an atom included in the unit cell, chemical formula of the crystal structure, composition conditions, space group, band gap, and energy above hull, and the crystal structure string may include information regarding lateral length and angle of a unit cell and element symbol and coordinate of an atom included in the unit cell, and some of the information may be provided in a masked form with some of the information remaining undetermined.
[0113] For example, the learning unit (120) may perform an evaluation operation that evaluates the crystal structure generation information generated according to the crystal structure information generation learning, and a feedback operation that feeds back the evaluation result to be reflected in subsequent learning. In this case, the evaluation operation may include at least one of an unconditional generation evaluation, a conditional generation evaluation, and a fill generation evaluation.
[0114] For example, the learning unit (120) may perform an unconditional generation evaluation, which evaluates the results of learning to generate crystal structure information without a crystal structure condition to be generated; a conditional generation evaluation, which evaluates the results of learning to generate crystal structure information based on a crystal structure condition to be generated; and a fill generation evaluation, which evaluates the results of learning to generate crystal structure information by filling in the masked parts of a crystal structure string.
[0115] For example, the learning unit (120) may perform an evaluation operation on the crystal structure generation information based on pre-set evaluation criteria. For example, the evaluation criteria may include evaluating the validity, coverage, property metrics, stability, etc. of each new material based on the crystal structure generation information.
[0116] For example, the learning unit (120) can perform image captioning learning to create annotations or descriptions for a given image. In this case, the data provided for image captioning learning may include only image data, or may include both image and text data.
[0117] For example, the learning unit (120) can perform VQA learning to answer visual questions related to a given image.
[0118] For example, Visual Question Answering (VQA) learning may include at least one of single-turn VQA learning consisting of single-turn interactions, each containing one question and one answer, and multi-turn VQA learning consisting of two or more multi-turn interactions.
[0119] For example, VQA training may include learning to answer visual questions in a free-form manner, or it may include multiple-choice learning to select the option that best describes a given image from four or more pre-set options. In some cases, VQA training may be performed using an interleaved dataset built based on scientific paper data.
[0120] For example, multiple-choice VQA learning may include learning according to "Setting 1," which presents a specific main image, a description of that image (answer key), and descriptions of other main images within the same paper as options.
[0121] As another example, multiple-choice VQA learning may include learning according to "Setting 2," which presents a specific sub-image within a specific main image, and presents a description of that sub-image (answer key) and descriptions of sub-images of other main images within the same paper as options.
[0122] As another example, multiple-choice VQA learning may include learning according to "Setting 3," which presents a specific sub-image within a specific main image, a description of that sub-image (answer key), and descriptions of other sub-images within the same main image as options.
[0123] For example, the learning unit (120) can perform 4-bit quantization on at least one of the visual encoder, language model, and projector. In some cases, 4-bit quantization can be performed on all of the visual encoder, language model, and projector. This allows the overall size of the artificial intelligence model to be reduced and the inference and answer speed to be increased.
[0124] For example, the learning unit (120) may perform LoRA fine-tuning on at least one of a visual encoder, a language model, and a projector. Here, LoRA fine-tuning includes adding a low-rank adapter and fine-tuning by additionally training only the added low-rank adapter. In some cases, LoRA fine-tuning may be performed on all of the visual encoder, the language model, and the projector. Through this, customized fine-tuning for the field of generating crystal structure information is possible using relatively small amounts of data, preventing overfitting in the training of artificial intelligence models, and improving data adaptability.
[0125] For example, the learning unit (120) can perform a learning operation based on preset hyperparameters. For example, the learning unit (120) can perform a learning operation with a batch size of 1 for 1 epoch, and when fine-tuning using a low-rank adapter is also performed, the low-rank adapter rank can be set to 8 and the low-rank adapter alpha to 32.
[0126] In summary, when a raw dataset containing both images and text is collected as disclosed herein and constructed into an interleaved dataset, and then the relationship between images and text is learned using a visual encoder, a language model, and a projector, the artificial intelligence model can acquire the ability to read the text included in the training dataset more effectively and, in addition, interpret images more accurately.
[0127] Accordingly, when using an artificial intelligence model according to the present disclosure, compared to a configuration that generates a decision structure using an existing text-based language model, performance can be improved in terms of the validity and stability of the generated decision structure information.
[0128] In addition, when condition learning and filling learning are performed simultaneously according to the present disclosure, the noise of the learning unit (120) can be reduced compared to the case where condition learning and filling learning are considered separately, and performance and efficiency can be improved in terms of diversity and flexibility of crystal structure information.
[0129] Hereinafter, one of the embodiments in which the learning unit (120) performs a learning operation is described as an example.
[0130] For example, the learning unit (120) may utilize a large-scale language and vision model to inject knowledge based on a multimodal dataset containing both images and text into the language model. For example, the learning may be performed using the LLaVA (Large Language and Vision Assistant) architecture.
[0131] For example, the learning unit (120) can perform the learning operation based on the LLaVA architecture, using a visual encoder CLIP ViT-L / 14-336, a language model LLaMA2, and an MLP projector having two layers.
[0132] For example, the learning unit (120) may include a configuration that randomly initializes only the MLP projector while leaving the language model and visual encoder as they are. In this case, the initialization of the MLP projector may include initializing the weights between the layers included in the MLP projector to random values.
[0133] Then, the learning unit (120) may include performing learning only for LLaMa2 and MLP projectors without performing additional learning for CLIP ViT-L / 14-336. In this case, the learning operation may be performed using both the first interleaved dataset built based on general domain data and the second interleaved dataset built based on scientific paper data through the data preprocessing unit (110).
[0134] For example, the learning unit (120) can perform a first learning on LLaMa2 and MLP projectors using a first interleaved dataset based on general domain data as learning data, and perform a second learning on LLaMa2 and MLP projectors using a second interleaved dataset based on scientific paper data as learning data.
[0135] The information generation unit (130) can be connected to at least one of the data preprocessing unit (110) and the learning unit (120), and can generate at least one or more decision structure information based on prompt input.
[0136] For example, the information generation unit (130) can generate crystal structure information regarding a crystal structure that satisfies the conditions included in the crystal structure condition information. In this case, the prompt is a condition prompt, and the crystal structure condition information may include information regarding conditions regarding the structure or properties of the crystal to be generated.
[0137] For example, if the information generation unit (130) determines that there are two or more crystal structures satisfying the conditions included in the crystal structure condition information, it can generate crystal structure information such that all of the information regarding the two or more crystal structures is included.
[0138] For example, a condition prompt can input a crystal structure condition to be generated and can generate and output crystal structure information based on the input condition. For example, the crystal structure condition information may include information on at least one of the chemical formula, composition condition, space group, band gap, and energy above hull of the crystal structure.
[0139] For example, the information generation unit (130) can generate the crystal structure information by filling in the masked parts of the crystal structure information. In this case, the prompt may be a fill prompt in which crystal structure information expressed in a format in which a part of the crystal structure string is masked can be input. Here, the crystal structure information may include information expressed in a format in which a part of the crystal structure string is masked.
[0140] For example, if the information generation unit (130) determines that there are two or more crystal structures that can be generated by filling in the masked parts of the crystal structure information, it can generate crystal structure information such that all of the information regarding the two or more crystal structures is included.
[0141] Details regarding the generation of such fill prompts and the crystal structure information based thereon are explained in more detail below in Fig. 7.
[0142] For example, the crystal structure information may include information regarding a crystal structure satisfying the conditions included in the crystal structure condition information, expressed in the form of a CIF string. For example, the crystal structure information may include information regarding the length and angle of lattice vectors, and the element type and coordinates of each atom within the lattice. Details regarding the generation of such condition prompts and crystal structure information based thereon are explained in more detail below in FIG. 6.
[0143] FIG. 2 is a graph for exemplarily illustrating dataset collection according to one embodiment.
[0144] Referring to FIG. 2, a raw dataset according to one embodiment may be collected to include papers on various scientific fields. The Y-axis of the illustrated graph represents the number of papers by field (210), and the X-axis represents each sub-field (220), representing the number of papers in the top 30 fields out of a total of 72 fields included in the raw dataset. In some cases, the collection of such a raw dataset may be carried out through a separate data collection unit.
[0145] For example, the raw dataset can be collected from papers published in Nature Communications, which publishes only peer-reviewed papers. According to the graph in Figure 2, it can be seen that the largest number of papers were collected from fields directly related to the creation of new materials, such as materials science and chemistry.
[0146] Generally, there are cases where the knowledge contained in scientific papers cannot be definitively categorized into a single field, and since learning indirectly related knowledge together can enable more stable and comprehensive learning in the training of artificial intelligence models, papers in scientific fields not directly related to the creation of new materials can also assist in the training of artificial intelligence models according to the present disclosure.
[0147] For example, raw datasets can be collected from literature containing both text and image data. For instance, raw datasets can be collected from papers in scientific fields directly or indirectly related to crystal structure generation that include various forms of images related to the content, such as graphs, diagrams, microscopic images, simulation images, and experimental results.
[0148] In the field of crystal structure generation, a significant portion of related knowledge, such as the structural and compositional characteristics of materials and the principles of material generation, involves three-dimensional information and may include a significant amount of specialized knowledge that is difficult for the average person to easily understand.
[0149] Considering these points, in order to enhance the expertise of AI models and generate crystal structures more stably and accurately when generating crystal structure information using AI models, it is necessary to construct training data in the form of multimodal data combining text and images in terms of data format, and to collect literature containing high-quality expertise whenever possible in terms of data content.
[0150] For example, raw datasets may be collected using pre-prepared internal data, or using publicly available or commercial data. In some cases, they may also be collected through web crawling based on pre-configured collection targets and scopes.
[0151] For example, a raw dataset can be collected by performing web crawling based on a pre-set collection range to collect papers related to a certain range of scientific fields.
[0152] For example, raw datasets can be collected by limiting the scope to peer-reviewed papers. To this end, the collection can be restricted to papers published on websites that exclusively handle peer-reviewed papers or allow searching for such papers. For instance, the collection can be limited to sites like Nature Communications, which deals exclusively with peer-reviewed papers, or Google Scholar, which enables searching for papers with a "peer-reviewed" option. Since the expertise and accuracy of the knowledge presented in peer-reviewed papers are guaranteed, they can be helpful in training artificial intelligence models more professionally.
[0153] FIG. 3 is a diagram illustrating, in an exemplary manner, the content of performing learning based on an interleaved dataset according to one embodiment.
[0154]
[0155] * Referring to FIG. 3, the learning unit (300) may include a visual encoder (310) that generates image feature data from an image, a language model (320) that generates text encoding data from text, and a projector (330) that converts the image feature data and text encoding data into data that can be shared.
[0156] For example, a visual encoder (310) can generate image feature data based on an input image. In this case, the image may be an image included in at least one of a raw dataset or an interleaved dataset. In some cases, a visual encoder that processes video images as well as still images may be used. And, the image feature data may include data expressed in a vector form that is relatively higher in dimension than text encoding data.
[0157] For example, a visual encoder (310) can extract features of an image, including colors, textures, boundaries, objects, and backgrounds identified from the image, and convert the extracted features into a vector in the form of numbers.
[0158] In some cases, the visual encoder (310) can extract image features by distinguishing between low-level features such as color, texture, edge, and gradient, mid-level features such as partial objects, geometric shapes, and positional relationships between partial objects, and high-level features such as whole objects, scenes, and semantic features.
[0159] For example, the visual encoder (310) may include an encoder that extracts image features based on a Transformer. For example, it may include a visual encoder such as a Video Transformer (ViT), a Swin Transformer, etc.
[0160] As another example, the visual encoder (310) may include an encoder that extracts image features based on a convolutional neural network (CNN). For example, it may include a visual encoder such as AlexNet, VGG, ResNet, Inception, EfficientNet, etc.
[0161] As another example, the visual encoder (310) may include an encoder that extracts image features based on a graph neural network (GNN). For example, it may include a visual encoder such as a Graph Convolutional Network (GCN), a Graph Attention Network (GAT), or a Graph Recurrent Network (GRN).
[0162] As another example, the visual encoder (310) may include an encoder based on two or more artificial intelligence computational technologies. For example, it may include a visual encoder such as ConVit, CLIP (Contrastive Language-Image Pre-training), in which both CNN and Transformer technologies can be utilized. In some cases, a visual encoder of a specific model, such as the pre-trained CLIP ViT-L / 14-336 among CLIPs, may be primarily utilized.
[0163] For example, the visual encoder (310) is not limited to a configuration that performs image encoding, but may also include a configuration that performs text encoding depending on the case. In this case, image encoding and text encoding may each be performed by different encoders.
[0164] For example, the visual encoder (310) is not limited to a configuration that performs data encoding, but may also include a configuration that learns input data in some cases. And a pre-trained visual encoder in which such learning has already been performed in advance may be used.
[0165] For example, a visual encoder (310) can generate image feature data by performing image encoding for each image based on two or more interleaved datasets. Then, learning can be performed by evaluating the image encoding results and modifying or supplementing related variables or weights.
[0166] For example, the visual encoder (310) can perform a first learning using a first interleaved dataset built based on general domain data and a second learning using a second interleaved dataset built based on scientific paper data.
[0167] For example, the language model (320) can generate text encoding data based on input text. In this case, the text may be text included in at least one of a raw dataset or an interleaved dataset.
[0168] For example, the language model (320) can perform a tokenization operation to separate the input text into tokens. In this case, the tokens can be separated based on whether they are language units having at least one meaning.
[0169] For example, the language model (320) can perform an embedding operation that converts each token into a vector in the form of a number. In this case, the positions of the vectors can be determined such that the distance between tokens with similar meanings is relatively close, and the distance between tokens with different meanings is relatively far.
[0170] In some cases, the language model (320) may perform embedding operations using known techniques. For example, the language model (320) may perform embedding operations using at least one of the following related techniques: Word2Vec based on surrounding words, GloVe based on the co-occurrence probability of words, FastText based on subwords, BERT based on context judgment, GPT, etc.
[0171] Additionally, the language model (320) may include a language model that is publicly available for free use or is permitted for use by contract. For example, language models such as the GPT series, Gemini, LaMDA, PaLM, T5, LLaMA series, BERT-based RoBERTa, ALBERT, XLNet, ELECTRA, SpanBERT, etc. may be used.
[0172] For example, the language model (320) can perform an encoding operation that connects the embedded data and converts them into a vector sequence. Then, through the vector sequence, it can learn the relationships between words within a sentence, the meaning and structure of the entire sentence, and contextual information regarding the relationship with other sentences.
[0173] In some cases, the language model (320) may perform an encoding operation using known techniques. For example, the language model (320) may perform an encoding operation using at least one of the following techniques: a Recurrent Neural Network (RNN) based on previous word information, a Long Short-Term Memory (LSTM) considering long-term dependencies, or a Transformer based on an attention mechanism.
[0174] For example, a language model (320) can learn text related to an image by referring to image feature data generated by a visual encoder. In this case, the image feature data and text encoding data are converted into data of the same dimension by a projector, and then the language model (320) can use the converted data to learn the relationship between the image and the text.
[0175] For example, the language model (320) may receive text data for performing a learning operation in the form of a file or in the form of a prompt, and depending on the case, both the file input method and the prompt input method may be used. For example, large-capacity data for large-scale learning may be received in the form of a text file, and small-capacity data for fine-tuning may be received in the form of a prompt.
[0176] For example, a language model (320) can generate text encoding data for each text based on two or more interleaved datasets. Then, learning can be performed by evaluating the text encoding results and modifying or supplementing related variables or weights.
[0177] For example, the language model (320) can perform a first learning using a first interleaved dataset built based on general domain data and a second learning using a second interleaved dataset built based on scientific paper data.
[0178] For example, the projector (330) can convert image feature data and text encoding data into data that can be shared. In this case, both image feature data and text encoding data may be converted, or only image feature data may be converted.
[0179] For example, the projector (330) can set up a common low-dimensional space where both images and text can be projected, and then convert image feature data and text encoding data based on this.
[0180] Here, the projector (330) may include any configuration that maps high-dimensional vector information to low-dimensional vector information. Depending on the case, the projector (330) may extract only a portion of the high-dimensional vector information and map it to low-dimensional vector information, or it may be used to determine the relationship between each piece of information by measuring the similarity between the low-dimensional vector information.
[0181] As another example, the projector (330) can convert image feature data based on the dimensions of the text encoding data. In this case, the dimensions of the image feature data can be reduced.
[0182] For example, the projector (330) may include a linear transformation-based projector. For example, it may include a projector based on a model such as data variance-based PCA (Principal Component Analysis), topic distribution-based (Latent Dirichlet Allocation), or variable observation-based FA (Factor Analysis). In this case, the projector (330) may be used to reduce the dimensionality of the given data based on matrix multiplication operations.
[0183] For example, the projector (330) may include a deep learning-based projector. For example, a projector based on a model such as a Multi-Layer Perceptron (MLP) based on multiple hidden layers, an Autoencoder with both an encoder and a decoder, a Variational Autoencoder (VAE) based on latent variable learning, or a Generative Adversarial Network (GAN) based on competitive learning may be used.
[0184] For example, the learning unit (300) may include a configuration that performs specific actions only on some of the visual encoder, language model, and projector. For example, it may include a configuration that performs an initialization action only on the projector, or a configuration that performs a learning action only on the language model and projector while the visual encoder is fixed and not learned.
[0185] For example, the learning unit (300) can perform crystal structure information generation learning with at least one of a visual encoder, a language model, and a projector having been pre-trained. For example, the learning unit (120) can perform crystal structure information generation learning with the language model and the projector having been pre-trained based on at least one of general domain data and scientific paper data.
[0186] For example, the learning unit (300) can perform a first learning based on a first interleaved dataset built based on general domain data among raw datasets, and can perform a second learning based on a second interleaved dataset built based on scientific paper data. And in some cases, the learning unit (300) can perform conditional learning and filling learning simultaneously after performing the first learning and the second learning.
[0187] FIG. 4 is a diagram illustrating, in an exemplary manner, the simultaneous performance of condition learning and filling learning according to one embodiment.
[0188] Referring to FIG. 4, a learning unit (400) according to one embodiment may include a visual encoder (410) that maintains a fixed state and does not participate in learning, a language model (420), and an initialized projector (430).
[0189] For example, the learning unit (400) can simultaneously perform conditional learning (440) and filling learning (450) based on the decision structure tuning data of the language model (420) and the initialized projector (430).
[0190] Here, condition learning (440) can perform learning that generates crystal structure information regarding the crystal structure.
[0191] Here, the filling learning (450) can perform learning to generate crystal structure information by filling in the masked parts of the input crystal structure string.
[0192] FIG. 5 is a diagram illustrating, in an exemplary manner, the content of performing crystal structure information generation learning according to one embodiment.
[0193] Referring to FIG. 5, an artificial intelligence model-based crystal structure information generation device (100) according to one embodiment can perform crystal structure information generation learning in which a language model learns to generate crystal structure information in the form of a CIF string including unit cell information and atomic information.
[0194]
[0195] For example, the learning for generating crystal structure information may be performed based on input data (510) including at least one of an image and text. In this case, the input data (510) may include at least one of image and text data containing information of the crystal structure to be generated.
[0196] For example, the crystal structure information generation learning may be performed by generating crystal structure information based on input data (510) and learning to output the generated information. In this case, the output information (520) may include crystal structure information regarding the structure of the crystal structure and the position information of the included elements.
[0197] For example, the output information (520) may include crystal structure information generated in the form of a CIF string, which is a text-form representation of the crystal structure. For example, the crystal structure information may include information regarding the lateral lengths (l1, l2, l3) and angles (θ1, θ2, θ3) of the unit cell and element symbols e for N atoms included in the unit cell. i and coordinate information (x i , y i , z i It may include ). In this case, the crystal structure information C can be expressed in the following form.
[0198] C = (l1, l2, l3, θ1, θ2, θ3, e1, x1, y1, z1, ... , e N , x N , y N , z N )
[0199] According to this, the "l1, l2, l3" part is the lateral length of the unit cell, the "θ1, θ2, θ3" part is the angle of the unit cell, and "e1, x1, y1, z1, ... , e N , x N , y N , z N The part is the atomic number (e) of each of the N atoms contained in the unit cell. i ) and coordinate information (x i , y i , z i It represents ).
[0200] For example, the crystal structure information generation learning may include condition learning to generate crystal structure information regarding a crystal structure that satisfies input crystal structure conditions. In this case, the crystal structure conditions may include at least one of the information regarding various conditions of the crystal structure to be generated, such as lateral length and angle information of a unit cell, atomic number and coordinate information of an atom included in the unit cell, chemical formula of the crystal structure, composition conditions, space group, band gap, and energy above hull.
[0201] For example, as illustrated in FIG. 5, an artificial intelligence model-based crystal structure information generation device (100) can receive input data (510) that includes a text part such as "H2O(s)" and an image part representing the crystal structure, and can perform crystal structure information generation learning based on the input data (510). In this case, although H2O(s) is not a crystal structure academically, it can be used as training data for a new material for the artificial intelligence model.
[0202] For example, an artificial intelligence model-based crystal structure information generation device (100) can generate and output information such as output information (520) based on input data (510). In this case, the output information (520) may include information regarding a unit cell of H2O(s).
[0203] For example, among the output information (520), "3.2 3.2 5.3" may include length information of the unit cell, "90 90 120" may include angle information of the unit cell, and "H 0.50 0.16 0.30 H 0.50 0.04 0.48 O 0.00 0.16 0.44" may each include atomic number and coordinate information of atoms H, H, and O included in the unit cell.
[0204] In addition, the artificial intelligence model-based crystal structure information generation device (100) can perform crystal structure information generation learning by comparing the content of such output information (520) with ground truth, evaluating it, and providing feedback when performing supervised learning.
[0205] As another example, the crystal structure information generation learning may include a filling learning that learns to generate crystal structure information by filling in masked parts of an input crystal structure string. In this case, the crystal structure string may include information regarding the lateral length and angle of a unit cell, and the elemental symbols and coordinates of atoms contained in the unit cell, and some of which may be provided in a masked format in an undetermined state.
[0206] FIG. 6 is a diagram illustrating, in an exemplary manner, a configuration for generating crystal structure information based on a condition prompt according to one embodiment.
[0207] Referring to FIG. 6, an artificial intelligence model-based crystal structure information generation device (100) can generate crystal structure information regarding a crystal structure satisfying an input condition (610) and can output condition output information (620) based on the generated information. In this case, the prompt is a condition prompt (600), and the crystal structure condition information may include information regarding conditions regarding the structure or properties of the crystal structure to be generated.
[0208] For example, the condition prompt (600) can input the conditions for the crystal structure to be generated, and can generate crystal structure information based on the input conditions (610) and output output information (620).
[0209] For example, crystal structure condition information may include information on at least one of the chemical formula of the crystal structure, composition conditions, space group, band gap, and unstable energy (energy above hull).
[0210] For example, the crystal structure information may include information regarding a crystal structure satisfying the conditions included in the crystal structure condition information, expressed in the form of a CIF string. For example, the crystal structure information may include information regarding the length and angle of lattice vectors, and the element type and coordinates of each atom within the lattice.
[0211] For example, if it is determined that there are two or more crystal structures satisfying the conditions included in the crystal structure condition information, the crystal structure information may be generated to include all information regarding such two or more crystal structures.
[0212] For example, as shown in FIG. 6, when "The chemical is Li2MnO2. The formation energy per atom is -2.0221." is entered into the input condition (610) of the condition prompt (600), crystal structure information can be generated based on the input content, and condition output information (620) can be output based on the generated information.
[0213] For example, among the condition output information (620), "3.2 3.2 5.3" may contain length information of the unit cell, "90 90 120" may contain angle information of the unit cell, and "Li 0.05 0.08 0.30 Li 0.72 0.41 0.57 Mn 0.39 0.75 0.94 O 0.72 0.41 0.18 O 0.05 0.08 0.69" may each contain atomic number and coordinate information of each atom Li, Li, Mn, O, O included in the unit cell.
[0214] FIG. 7 is a diagram illustrating, in an exemplary manner, a configuration for generating crystal structure information based on a fill prompt according to one embodiment.
[0215] Referring to FIG. 7, an artificial intelligence model-based crystal structure information generation device (100) can generate crystal structure information by filling in a masked portion of crystal structure information (710) input to a prompt, and can output a fill output information (720) based on the generated information. In this case, the prompt is a fill prompt (700), and the crystal structure information (710) may include information expressed in a format where a part of the crystal structure string is masked.
[0216] For example, if it is determined that there are two or more crystal structures that can be generated by filling in the masked parts of the crystal structure information, the crystal structure information may be generated to include all of such two or more crystal structures.
[0217] For example, as illustrated in FIG. 7, if information in the form of a crystal structure string such as "[Crystal string with [MASK]s]" is input to the crystal structure information (710) of the fill prompt (700), but with parts of it masked, crystal structure information can be generated based on the input content, and fill output information (720) can be output based on the generated information.
[0218] In this case, the crystal structure information can be generated in a form such as "[Crystal string with [ Masked element ] ]", where the masked part in the crystal structure information (710) is filled with specific element information, etc., and the filled output information (720) can be output based on the generated information.
[0219] FIGS. 8 to 11 are graphs illustrating, in an exemplary manner, the data preprocessing process of an artificial intelligence model-based crystal structure information generation device according to one embodiment.
[0220] FIG. 8 is a diagram illustrating, in an exemplary manner, the configuration of a raw dataset according to one embodiment.
[0221] Referring to FIG. 8, an artificial intelligence model-based crystal structure information generation device (100) according to one embodiment can collect a raw dataset (800). In some cases, the raw dataset (800) may have text (810) and images (820) collected separately. In such cases, even when collecting the same paper data, the text portion and the image portion of the paper may be collected separately. In this case, as shown in FIG. 8, the raw dataset (800) may be in a form where the text (810) and images (820) are separated.
[0222] For example, the raw dataset (800) may include text (810) which is the text 001 (812), text 002 (814), and text 003 (816) which is the part that first mentions each image, and images (820) which may include images 001 (822), images 002 (824), and images 003 (826).
[0223] For example, in text (810), a sentence containing text 001 (812) may describe image 001 (822), a sentence containing text 002 (814) may describe image 002 (824), and a sentence containing text 003 (816) may describe image 003 (826).
[0224] For example, if the text (810) is in the form of separate sentences, the image can be inserted based on the position of the sentence where the image is first mentioned. For example, interleaved preprocessing can be performed by inserting the image between the sentence where the image is first mentioned and the preceding sentence.
[0225] An artificial intelligence model-based crystal structure generation device according to one embodiment can insert Image 001 (822) based on the location of Text 001 (812) where Image 001 (822) is first mentioned, insert Image 002 (824) based on the location of Text 002 (814) where Image 002 (824) is first mentioned, and insert Image 003 (826) based on the location of Text 003 (816) where Image 003 (826) is first mentioned. In this way, the raw dataset (800) can be constructed into an interleaved dataset in which images and text are arranged to intersect.
[0226] FIG. 9 is a diagram illustrating, in an exemplary manner, a configuration for constructing a raw dataset into an interleaved dataset according to one embodiment.
[0227] Referring to FIG. 9, an artificial intelligence model-based crystal structure information generation device (100) according to one embodiment can perform preprocessing to convert a raw dataset into an interleaved dataset (900). And this preprocessing can be performed through a data preprocessing unit (910).
[0228] For example, the interleaved dataset (900) can be arranged so that image 001 (912) and text 001 (922) intersect, image 002 (914) and text 002 (924) intersect, and image 003 (916) and text 003 (926) intersect. And such interleaved preprocessing can be performed by the data preprocessing unit (110).
[0229] As such, the preprocessing step of interleaving related text and images in the correct positions can provide several advantages in training language models such as LLMs (Large Language Models) or multimodal models such as LMMs (Large Multimodal Models).
[0230] For example, in the case of language models based on sequence modeling, text and images preprocessed from raw datasets into interleaved datasets can provide more natural sequences, which can help LLMs or LMMs learn not only the individual content of multimodal data but also its context.
[0231] For example, preprocessing the raw dataset into an interleaved dataset (900) can also help in learning contextual relationships between text and images jointly. For example, text 001 (922) located after image 001 (912), text 002 (924) located after image 002 (914), text 003 (926) located after image 003 (916), etc., can help in learning that each likely contains a description of the corresponding image.
[0232] For example, preprocessing the raw dataset into an interleaved dataset (900) can help in learning that if the contents of adjacent or close images, such as image001 (912), image002 (914) and image003 (916), are related to each other, the contents of the corresponding texts, text001 (922), text002 (924) and text003 (926), are also likely to be related to each other.
[0233] For example, preprocessing a raw dataset into an interleaved dataset (900) can help in learning concepts that encompass the same when there is common content in images 001 (912), images 002 (914), and images 003 (916) representing related content, and corresponding texts 001 (922), texts 002 (924), and texts 003 (926).
[0234] FIG. 10 is a diagram illustrating, in an exemplary manner, a configuration for categorizing and classifying each image of an interleaved dataset according to one embodiment.
[0235] Referring to FIG. 10, an artificial intelligence model-based crystal structure information generation device (100) according to one embodiment can classify and label the category of each image based on preset image category classification criteria. And this preprocessing can be performed through a data preprocessing unit (110).
[0236] In this case, image category classification for each image may be performed on a raw dataset in which the images and text are separated from each other, or on an interleaved dataset in which corresponding images and text are arranged to intersect. Below, an embodiment for performing image category classification and labeling on an interleaved dataset (1000) is described.
[0237] For example, an artificial intelligence model-based crystal structure information generation device (100) can perform image category classification and image labeling using a pre-trained language model. Alternatively, image category classification and image labeling can be performed using a separate image category classification device.
[0238] For example, the image category classification criteria may include at least one of charts / graphs (Type 1), diagrams (Type 2), microscopic images (Type 3), macroscopic images (Type 4), simulation images (Type 5), map images (Type 6), and experimental images (Type 7).
[0239] For example, an artificial intelligence model-based crystal structure information generation device (100) can classify each image included in an interleaved dataset (1000) according to image category classification criteria and label it according to the classification result.
[0240] For example, image 001 (1012) can be classified as "microscopic image" when classified according to image category classification criteria. Accordingly, image 001 (1012) can be labeled as "Type 3".
[0241] For example, image 002 (1014) can be classified as "graph" when classified according to image category classification criteria. Accordingly, image 002 (1014) can be labeled as "Type 1".
[0242] For example, image 003 (1016) can be classified as an "experimental image" when classified according to image category classification criteria. Accordingly, image 003 (1016) can be labeled as "Type 7".
[0243] FIG. 11 is a diagram illustrating, by way of example, a configuration for constructing an interleaved dataset for sub-images according to one embodiment.
[0244] Referring to FIG. 11, an artificial intelligence model-based crystal structure information generation device (100) according to one embodiment can construct an interleaved dataset by arranging subtexts corresponding to subimages so that they intersect when at least one subimage is included in one image. In this case, each subtext corresponding to each subimage can be arranged so that they intersect each other based on the position where each subimage is first mentioned within each subtext. And this preprocessing can be performed through a data preprocessing unit (110).
[0245] For example, an artificial intelligence model-based crystal structure information generation device (100) can perform interleaved preprocessing on the sub-image and sub-text of an image 001 (1110) included in an interleaved dataset.
[0246] For example, an artificial intelligence model-based crystal structure information generation device (100) can identify an image index within text using a preset regular expression matching function. Additionally, through this, the location of the identified image index within the text can be determined, and the corresponding image can be inserted based on the determined location.
[0247] For example, an artificial intelligence model-based crystal structure information generation device (100) can identify a sub-image index within a text using a regular expression matching function when at least one sub-image exists in a specific image. In this case, the regular expression matching function can identify the image index and the sub-image index based on different identification criteria.
[0248] For example, a regular expression matching function may include a function that reads text and identifies parts that are image indices when they satisfy regular expression matching criteria. For instance, when reading a section in the text of an English paper where "Figure" and "number" appear together, the function may determine that this refers to a specific image inserted in the paper and extract "Figure+number" as the index of that image.
[0249] As another example, a regular expression matching function may be used to extract sub-image indices from a text portion corresponding to a main image. For example, if i) a main image and a text portion corresponding thereto are specified, ii) sub-images of "A", "B", "C", and "D" exist in the main image, iii) there are parts within the text portion where "A", "B", "C", and "D" are each mentioned as a single character, iv) the parts are included in adjacent sentences, and v) they are mentioned in alphabetical order, the regular expression matching function can determine whether all of the above conditions i) through v) are satisfied, and if they are satisfied, extract each "A", "B", "C", and "D" as sub-image indices.
[0250] For example, an artificial intelligence model-based crystal structure information generation device (100) can identify a sub-image index within a text using a regular expression matching function when at least one sub-image exists in a specific image. In this case, the regular expression matching function can identify the image index and the sub-image index based on different identification criteria.
[0251] For example, in an interleaved dataset, the interleaved dataset image 001 (1100) may include a portion representing content related to image 001 (1110). In this case, image 001 (1110) and text 001 (1120) may include corresponding content.
[0252] For example, image001 (1110) may include sub-images. For example, it may include four sub-images such as image001_subA (1112), image001_subB (1114), image001_subC (1116) and image001_subD (1118).
[0253] For example, text001 (1120) may include subtexts. For example, it may include four subtexts such as text001_subA (1122), text001_subB (1124), text001_subC (1126), and text001_subD (1128).
[0254] For example, each sub-image and sub-text may contain content corresponding to one another. For example, Image 001_Sub A (1112) may contain content corresponding to Text 001_Sub A (1122), Image 001_Sub B (1114) may contain content corresponding to Text 001_Sub B (1124), Image 001_Sub C (616) may contain content corresponding to Text 001_Sub C (626), and Image 001_Sub D (618) may contain content corresponding to Text 001_Sub D (628).
[0255] In this case, the artificial intelligence model-based crystal structure information generation device (100) according to one embodiment can extract Text001_SubA (1122) from Text001 (1120) as a sub-image index for Image001_SubA (1112), Text001_SubB (1124) as a sub-image index for Image001_SubB (1114), Text001_SubC (1126) as a sub-image index for Image001_SubC (1116), and Text001_SubD (1128) as a sub-image index for Image001_SubD (1118). Here, the extraction of each sub-image index can be performed using a regular expression matching function.
[0256] Below, a method for generating new material information based on an artificial intelligence model using an artificial intelligence model-based new material information generation device (100) capable of performing all the aforementioned contents in the present disclosure is described. Content that overlaps with the above description may be omitted depending on the case, but all of the following methods may also be applicable.
[0257] FIG. 12 is a flowchart relating to a method for generating new material information based on an artificial intelligence model according to the present disclosure.
[0258] Referring to FIG. 12, a method for generating new material information based on an artificial intelligence model according to one embodiment may include a data preprocessing step (S1210), a learning step (S1220), and an information generation step (S1230).
[0259] For example, a method for generating new material information based on an artificial intelligence model may include a data preprocessing step (S1210) for constructing an interleaved dataset by arranging text and images so that they intersect when the content of text and images included in a raw dataset corresponds to each other, a learning step (S1220) for learning the relationship between images and text based on the converted data, including a visual encoder that generates image feature data from images, a language model that generates text encoding data from text, and a projector that converts image feature data and text encoding data into data of the same dimension, and an information generation step (S1230) for generating at least one piece of new material information based on a prompt input to the language model.
[0260] In the data preprocessing step (S1210), a new dataset can be constructed by preprocessing the data included in the raw dataset. For example, if the content of text and images included in the raw dataset corresponds to each other, an interleaved dataset can be constructed by arranging the text and images so that they intersect.
[0261] For example, the data preprocessing step (S1210) may include constructing an interleaved dataset by inserting images based on the position where an image is first mentioned in the text.
[0262] For example, the data preprocessing step (S1210) may include labeling images based on preset image category classification criteria. Image category classification and image labeling may be performed using a pre-trained language model.
[0263] For example, in the data preprocessing step (S1210), an image index within the text can be identified using a pre-set regular expression matching function. Additionally, the location of the identified image index within the text can be determined, and the corresponding image can be inserted based on the determined location.
[0264] For example, the data preprocessing step (S1210) may include constructing an interleaved dataset by arranging at least one sub-image included in the image and a sub-text corresponding to the sub-image so as to intersect.
[0265] For example, in the data preprocessing step (S1210), a first interleaved dataset can be constructed based on general domain data, and a second interleaved dataset can be constructed based on scientific paper data.
[0266] For example, in the data preprocessing step (S1210), training data necessary for performing training in the training unit (120) can be generated based on an interleaved dataset. In this case, the training data may include image captioning training data and VQA training data.
[0267] In addition, depending on the case, in the data preprocessing step (S1210), a benchmark dataset can be generated to evaluate the inference and answering capabilities of an artificial intelligence model based on an interleaved dataset. In this case, the benchmark dataset can be generated within a scope that includes the content of the interleaved dataset but does not overlap with the content of the training data.
[0268] In the learning step (S1220), a visual encoder that generates image feature data from an image, a language model that generates text encoding data from text, a projector that converts the image feature data and text encoding data into data of the same dimension, and a relationship between the image and the text can be learned based on the converted data.
[0269] For example, in the learning step (S1220), after initializing the projector, fine-tuning can be performed to train the language model and the projector based on an interleaved dataset. In some cases, the visual encoder may train the language model and the projector while remaining in a fixed state without participating in the training.
[0270] For example, in the learning step (S1220), crystal structure information generation learning can be performed, in which the language model is trained to generate crystal structure information in the form of a CIF string containing unit cell information and atomic information. In this case, the crystal structure information can be generated in a form that includes the structure of the crystal and location information of the included elements.
[0271] For example, the learning step (S1220) may include performing a first learning step to learn the relationship between images and text regarding content in a general domain based on a first interleaved dataset, and then performing a second learning step to learn the relationship between images and text regarding content in a scientific paper based on a second interleaved dataset.
[0272] For example, in the learning step (S1220), condition learning to generate crystal structure information regarding a crystal structure that satisfies the input crystal structure condition, and filling learning to generate a crystal structure by filling in the masked parts of the input crystal structure string can be performed simultaneously.
[0273] In the information generation step (S1230), at least one crystal structure information can be generated based on prompt input.
[0274] In the information generation step (S1230), at least one crystal structure information can be generated based on prompt input.
[0275] For example, in the information generation step (S1230), crystal structure information regarding a crystal structure satisfying a condition included in the crystal structure condition information can be generated. In this case, the prompt may be a condition prompt in which crystal structure condition information regarding at least one of the chemical formula, composition condition, space group, band gap, and energy above hull of the crystal structure can be input.
[0276] For example, in the information generation step (S1230), new material information can be generated by filling in the masked parts of the crystal structure information. In this case, the prompt may be a fill prompt in which crystal structure information expressed in a format where a part of the crystal structure string is masked can be input.
[0277] FIG. 13 is a flowchart relating to a dataset preprocessing step according to one embodiment.
[0278] Referring to FIG. 13, a data preprocessing step according to one embodiment may include a main image interleave step (S1310), a sub-image interleave step (S1320), and an image category labeling step (S1330).
[0279] The main image interleave step (S1310) may include arranging each main image included in the raw dataset and the corresponding text so as to intersect. In this case, it may include constructing an interleaved dataset by inserting the image based on the position where the main image is first mentioned within the text.
[0280] In the sub-image interleave step (S1320), if each main image contains at least one sub-image, each sub-image and its corresponding sub-text may be arranged to intersect.
[0281] For example, in the sub-image interleave step (S1320), a sub-image index can be extracted for each sub-image using a regular expression matching function, and each sub-image and its corresponding sub-text can be arranged to intersect using the sub-image index.
[0282] In the image category labeling step (S1330), images can be labeled based on pre-set image category classification criteria. In this case, image category classification for each image can be performed on an interleaved dataset in which corresponding images and text are arranged to intersect.
[0283] For example, in the image category labeling step (S1330), image category classification and image labeling can be performed using a pre-trained language model. In addition, the image category classification criteria may include at least one of a chart / graph, a diagram, a microscopic photograph, a macroscopic photograph, a simulation image, a map image, and an experimental image.
[0284] FIG. 14 is a flowchart relating to a learning step according to one embodiment.
[0285] Referring to FIG. 14, a learning step according to one embodiment may include a projector initialization step (S1410), a visual encoder fixing step (S1420), and a multimodal learning step (S1430).
[0286] In the projector initialization step (S1410), the configuration may include randomly initializing only the projector while leaving the language model and visual encoder as they are. In this case, the initialization of the projector may include initializing the weights between the layers included in the projector to random values. This prevents the projector from performing training in an excessively biased state or overfitting during the subsequent training process.
[0287] The visual encoder fixing step (S1420) may include fixing the visual encoder so that no additional learning is performed on the visual encoder. This prevents the visual encoder from being overfitted.
[0288] The multimodal learning step (S1430) may include performing learning based on a multimodal dataset. In this case, since the visual encoder is fixed in the previous step, learning may be performed only on the language model and the projector. In this case, the multimodal dataset may include data in a form that contains both image and text data.
[0289] For example, in the multimodal learning step (S1430), a learning operation can be performed using both a first multimodal dataset built based on general domain data and a second multimodal dataset built based on scientific paper data. In this case, the first multimodal dataset and the second multimodal dataset can be constructed in the form of interleaved datasets in which the text and images included in each are arranged to intersect each other.
[0290] For example, in the multimodal learning step (S1430), a first learning step may be performed on a language model and a projector using a first multimodal dataset based on general domain data as learning data, and a second multimodal dataset based on scientific paper data may be used as learning data to perform a first learning step on a language model and a projector.
[0291] FIG. 15 is a flowchart relating to a multimodal learning step according to one embodiment.
[0292] Referring to FIG. 15, a multimodal learning step according to one embodiment may include a visual encoding step (S1510), a projecting step (S1420), a language model input step (S1530), an image-text relationship learning step (S1540), and a decision structure information generation learning step (S1550).
[0293] The visual encoding step (S1510) may include training the visual encoder to generate image feature data based on the input image. For example, the visual encoding step (S1510) may include extracting image features, such as color, texture, boundaries, objects, and background, identified from the image, and converting the extracted features into a vector in the form of numbers.
[0294] The language model input step (S1520) may include training the language model to generate text encoding data based on the input text. For example, the language model input step (S1520) may include performing a tokenization operation to separate the input text into tokens, an embedding operation to convert each token into a numerical vector, and an encoding operation to concatenate the embedded data into a vector sequence.
[0295] The projection step (S1530) may include learning the operation of converting image feature data and text encoding data into data of a shareable form. For example, the projection step (S1530) may include establishing a common low-dimensional space where both images and text can be projected, and then converting the image feature data and text encoding data, respectively, based on this space.
[0296] The image-text relationship learning step (S1540) may include learning the relationship between the features appearing in the image and the content of the text based on the data converted in the projection step (S1530).
[0297] The crystal structure information generation learning step (S1550) may include learning to generate crystal structure information based on input data using a pre-trained visual encoder, language model, and projector. In this case, the crystal structure information may be generated in the form of a CIF string containing the structure of the crystal and location information of the included elements.
[0298] For example, the crystal structure information generation learning step (S1550) may include simultaneously performing condition learning to generate new material information regarding a crystal structure that satisfies input crystal structure conditions, and filling learning to generate crystal structure information by filling in masked parts in an input crystal structure string.
[0299] As explained in the above, the present disclosure can provide an artificial intelligence model-based crystal structure information generation device capable of generating crystal structure information.
[0300] The above description is merely an illustrative explanation of the technical concept of the present disclosure, and those skilled in the art to which the present disclosure pertains may make various modifications and variations within the scope of the essential features of the technical concept. Furthermore, since these embodiments are intended to explain rather than limit the technical concept of the present disclosure, the scope of the technical concept is not limited by these embodiments. The scope of protection of the present disclosure shall be interpreted by the claims below, and all technical concepts within an equivalent or equivalent scope shall be interpreted as being included within the scope of rights of the present disclosure.
[0301]
[0302] CROSS-REFERENCE TO RELATED APPLICATION
[0303] This patent application claims priority pursuant to Section 119(a) of the U.S. Patent Act (35 USC § 119(a)) to Korean Patent Application No. 10-2024-0189829 filed on December 18, 2024, all of which are incorporated by reference into this patent application. Additionally, this patent application claims priority in countries other than the United States for the same reasons as above, all of which are incorporated by reference into this patent application.
Claims
1. A data preprocessing unit that constructs an interleaved dataset by arranging the text and the images so that they intersect when the contents of the text and images included in the raw dataset correspond to each other; It includes a visual encoder that generates image feature data from the image, a language model that generates text encoding data from the text, and a projector that converts the image feature data and the text encoding data into data in a shareable form, and learns the relationship between the image and the text based on the converted data. A learning unit that, after initializing the projector, performs fine-tuning to simultaneously train the language model and the projector based on the interleaved dataset to perform condition learning that generates information satisfying specific conditions and filling learning that fills masked parts; and An artificial intelligence model-based crystal structure information generation device comprising an information generation unit that generates at least one crystal structure information regarding a crystal structure based on prompt input.
2. In Paragraph 1, The above condition learning is, A crystal structure information generation device based on an artificial intelligence model that performs learning to generate crystal structure information regarding a crystal structure satisfying input crystal structure conditions.
3. In Paragraph 1, The above filling learning is, A crystal structure information generation device based on an artificial intelligence model that performs learning to generate crystal structure information by filling in masked parts in an input crystal structure string.
4. In Paragraph 1, The above prompt is, A condition prompt in which crystal structure condition information regarding at least one of the chemical formula, composition condition, space group, band gap, and unstable energy (energy above hull) of the crystal structure can be input, and The above information generation unit is, An artificial intelligence model-based crystal structure information generation device that generates crystal structure information regarding the crystal structure of the above crystal structure condition information.
5. In Paragraph 1, The above prompt is, It is a fill prompt where crystal structure information can be entered, expressed in a format where part of the crystal structure string is masked, and The above information generation unit is, An artificial intelligence model-based crystal structure information generation device that generates crystal structure information by filling in masked parts among the above crystal structure information.
6. In Paragraph 1, The above data preprocessing unit is, An artificial intelligence model-based decision structure information generation device that constructs the interleaved dataset by inserting the image based on the position where the image is first mentioned within the text.
7. In Paragraph 1, The above data preprocessing unit is, An artificial intelligence model-based decision structure information generation device that constructs the interleaved dataset by arranging at least one sub-image included in the above image and a sub-text corresponding to the above sub-image so as to intersect.
8. In Paragraph 1, The above data preprocessing unit is, An artificial intelligence model-based decision structure information generation device that labels the above image based on preset image category classification criteria.
9. In Paragraph 1, The above raw dataset is, Includes general domain data and scientific paper data, The above data preprocessing unit is, An artificial intelligence model-based decision structure information generation device that constructs a first interleaved dataset based on the above general domain data and constructs a second interleaved dataset based on the above scientific paper data.
10. In Paragraph 9, The above learning unit is, After performing a first learning process to learn the relationship between the image and the text regarding content in a general domain based on the first interleaved dataset, An artificial intelligence model-based crystal structure information generation device that performs a second learning to learn the relationship between the image and the text regarding the content of a scientific paper based on the second interleaved dataset.
11. In Paragraph 1, The above learning unit is, A crystal structure information generation device based on an artificial intelligence model that performs crystal structure information generation learning, wherein the above language model is trained to generate the above crystal structure information in the form of a CIF string including unit cell information and atomic information.
12. A data preprocessing step for constructing an interleaved dataset by arranging the text and images so that they intersect, when the contents of the text and images included in the raw dataset correspond to each other; It includes a visual encoder that generates image feature data from the image, a language model that generates text encoding data from the text, and a projector that converts the image feature data and the text encoding data into data in a shareable form, and learns the relationship between the image and the text based on the converted data. A learning step comprising, after initializing the projector, a fine-tuning step that simultaneously performs a condition learning step for generating information satisfying specific conditions for the language model and the projector based on the interleaved dataset, and a filling learning step for filling in masked parts; and A method for generating crystal structure information based on an artificial intelligence model, comprising an information generation step of generating at least one crystal structure information regarding a crystal structure based on prompt input.
13. In Paragraph 12, The above condition learning step is, A method for generating crystal structure information based on an artificial intelligence model that performs learning to generate crystal structure information regarding a crystal structure satisfying input crystal structure conditions.
14. In Paragraph 12, The above filling learning step is, A method for generating crystal structure information based on an artificial intelligence model that performs learning to generate crystal structure information by filling in masked parts in an input crystal structure string.
15. In Paragraph 12, The above prompt is, A condition prompt in which crystal structure condition information regarding at least one of the chemical formula, composition condition, space group, band gap, and unstable energy (energy above hull) of the crystal structure can be input, and The above information generation step is, A method for generating crystal structure information based on an artificial intelligence model that generates crystal structure information regarding a crystal structure satisfying conditions included in the above crystal structure condition information.
16. In Paragraph 12, The above prompt is, It is a fill prompt where crystal structure information can be entered, expressed in a format where part of the crystal structure string is masked, and The above information generation step is, A method for generating crystal structure information based on an artificial intelligence model that generates crystal structure information by filling in masked parts among the above crystal structure information.
17. In Paragraph 12, The above data preprocessing step is, A method for generating decision structure information based on an artificial intelligence model, comprising constructing the interleaved dataset by inserting the image based on the position where the image is first mentioned within the text.
18. In Paragraph 12, The above data preprocessing step is, A method for generating decision structure information based on an artificial intelligence model, comprising constructing the interleaved dataset by arranging at least one sub-image included in the above image and a sub-text corresponding to the sub-image so as to intersect.
19. In Paragraph 12, The above data preprocessing step is, A method for generating a decision structure based on an artificial intelligence model, comprising labeling the image based on preset image category classification criteria.
20. In Paragraph 12, The above raw dataset is, Includes general domain data and scientific paper data, The above data preprocessing step is, A first interleaved dataset is constructed based on the above general domain data, and a second interleaved dataset is constructed based on the above scientific paper data, and The above learning step is, A method for generating decision structure information based on an artificial intelligence model, wherein a first learning is performed to learn the relationship between the image and the text regarding content in a general domain based on the first interleaved dataset, and a second learning is performed to learn the relationship between the image and the text regarding content in a scientific paper based on the second interleaved dataset.