Method and device for predicting gastric cancer using optimal combination in cross attention mechanism

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The AI-based gastric cancer diagnostic technology optimizes the interaction between biopsy images and patient data using a cross-attention mechanism, addressing subjective biases and enhancing diagnostic accuracy and personalization.

WO2026142413A1PCT designated stage Publication Date: 2026-07-02URBAN DATA LAB

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: URBAN DATA LAB
Filing Date: 2025-01-17
Publication Date: 2026-07-02

Application Information

Patent Timeline

17 Jan 2025

Application

02 Jul 2026

Publication

WO2026142413A1

IPC: G16H50/70; G16H10/20; G16H10/60; G16H30/40; G16H50/20; G16H30/20; G06T7/00; G06T7/11; G06N3/045; G06N3/0985

AI Tagging

Technology Topics

Cancers diagnosisHeat map

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A combination of protein biomarkers for screening multiple cancer types and its applications
CN122307104ACancers diagnosisOncology
A circRNA marker for gastric cancer diagnosis and application thereof
CN122279037ANode metastasisMolecular diagnostic techniques
Serum marker combination for gastric cancer diagnosis and application thereof
CN122307107ASerum markersCancers diagnosis
Iron polyphenol nanocapsule, and preparation method and application thereof
CN117017941BAchieve time and space separationsmall sizeCancers diagnosisChemo therapy
Preparation method and application of aggregation-induced emission nanoparticles
CN117800958BProduce efficientlyImprove targeting Organic chemistry Energy modified materialsElectron donorCancers diagnosis

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Conventional gastric cancer diagnosis methods rely heavily on subjective interpretation of endoscopic biopsies and fail to integrate non-medical data like lifestyle habits, leading to inaccurate and non-personalized risk assessments.

Method used

An AI-based diagnostic technology that optimizes the interaction between gastric biopsy tissue images and patient survey data using a cross-attention mechanism, dynamically selecting optimal Query, Key, and Value combinations to enhance diagnostic performance and personalization.

Benefits of technology

Improves the accuracy and reliability of gastric cancer risk assessment by integrating diverse data types, enabling personalized prevention and management strategies through enhanced model interpretability and performance optimization.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure KR2025099067_02072026_PF_FP_ABST

Patent Text Reader

Abstract

A method performed by a device operated by a processor according to one embodiment may comprise the operations of: acquiring training data including a patient's gastric biopsy tissue image and survey text; extracting image features from the image by using an image encoder; extracting text features from the text by using a text encoder; selecting a first combination applied in the attention weight-based heatmap, from among the heatmaps generated from combinations of query, key, and value based on the image features and the text features, which shows the highest similarity to a bounding box labeled in the gastric biopsy tissue image, the combinations being applicable to a cross attention mechanism; generating combined features on the basis of the first combination by using the cross attention layer; and training an MLP model in which input and output are set to predict the correct answer class for gastric cancer diagnosis from the combined features.

Need to check novelty before this filing date? Find Prior Art

Description

Gastric cancer prediction method and device using optimal combination of cross-attention mechanisms

[0001] The present invention relates to an artificial intelligence-based data analysis technology, and more specifically, to a technology that improves the prediction accuracy of gastric cancer diagnosis by determining the optimal Q, K, and V combination based on a cross-attention mechanism.

[0002]

[0003] Gastric cancer is one of the cancers with a high mortality rate worldwide, with particularly high incidence and lethality reported in East Asia. Because gastric cancer is difficult to detect in its early stages, it is often diagnosed at an advanced stage, which significantly impacts survival rates. The pathological mechanisms of gastric cancer are complexly influenced by various environmental factors, infections, genetic elements, and lifestyle factors.

[0004] Among these risk factors, infection with Helicobacter pylori (hereinafter referred to as 'H. pylori') is known to be one of the major causes of gastric cancer, and research has revealed the process by which the infection progresses to gastric cancer through precancerous lesions such as atrophic gastritis, intestinal metaplasia, and dysplasia, depending on the stage of infection progression.

[0005] Conventional gastric cancer diagnosis and risk assessment have primarily relied on gastric endoscopic biopsy and pathological analysis. While these methods are effective for evaluating the morphological characteristics of lesions, they carry the potential for subjective elements because the interpretation process heavily depends on the experience and judgment of experts. Furthermore, non-medical data, such as lifestyle habits, are often not sufficiently considered or quantitatively analyzed during the diagnostic process, which can undermine the comprehensiveness and precision of the diagnosis.

[0006] Such existing diagnostic methods sometimes fail to clearly identify precancerous lesions in the early stages of gastric cancer. This stems from the uncertainty of subjective judgment, as well as the lack of tools capable of quantitatively evaluating the severity of H. pylori infection or atrophic changes.

[0007] Furthermore, there is a problem in that patient-specific risk assessment is limited because patients' lifestyle data is not adequately reflected in medical judgments. For example, while lifestyle habits such as smoking, drinking, and dietary habits significantly influence the risk of gastric cancer, existing diagnostic methods fail to structurally integrate these factors.

[0008] Under these circumstances, to enhance the accuracy and reliability of gastric cancer diagnosis and risk assessment, technology is required to integrate and correlate heterogeneous data, such as pathological analysis from endoscopic biopsies and patient lifestyle data. Through this, it will be possible to go beyond merely confirming the presence or absence of lesions to precisely assess the individual patient's risk and propose customized prevention and management strategies.

[0009] Therefore, the present invention aims to propose an AI-based technology capable of more precisely analyzing gastric cancer risk factors and achieving optimal diagnostic and predictive performance by fusing image and text data.

[0010]

[0011] The objective of the present invention is to propose an AI-based diagnostic technology capable of effectively fusing and analyzing gastric biopsy tissue images and patient survey data to improve the accuracy and reliability of gastric cancer risk assessment. Through this, the aim is to establish a technological foundation for precisely evaluating individual patient risk factors and presenting customized prevention and management strategies.

[0012] In particular, the present invention aims to provide a technology that optimizes the interaction between heterogeneous data by applying the cross-attention mechanism of a transformer model and maximizes diagnostic performance through the optimization of Query, Key, Value (Q, K, V) combinations as follows.

[0013] Specifically, the present invention proposes a technique for finding optimal Q, K, and V settings by analyzing the concentration of attention similarity-based heatmaps for each Q, K, and V combination and the correlation with gastric cancer lesion sites. To this end, the present invention utilizes attention similarity-based heatmaps to evaluate how effectively a model focuses on pathological features (e.g., gastric cancer lesions) and derives the Q, K, and V combinations that achieve the most meaningful data interactions. Through this, the interpretability of the model is enhanced, and reliable diagnostic results can be provided.

[0014] Furthermore, the present invention aims to provide a technology that evaluates the performance of Q, K, and V combinations based on quantitative indicators and selects the optimal combination. To this end, the present invention proposes a method for dynamically selecting the Q, K, and V combination exhibiting the highest performance during the learning process by comparing and analyzing the diagnostic performance of each combination based on performance indicators such as precision, recall, and F1 score. This enhances the efficiency of data learning and supports the automatic derivation of optimal settings without the need to experiment with various combinations.

[0015] Furthermore, the present invention aims to provide a technology that groups patients by clustering statistical distributions based on patient survey data and applies optimized Q, K, and V settings to each group. To this end, the present invention divides patients into groups of different characteristics based on lifestyle data (e.g., smoking, drinking, etc.) and realizes personalized risk assessment by applying Q, K, and V combinations specialized for each group. Through this, it is possible to provide differentiated diagnostic results reflecting group characteristics and suggest prevention and management strategies optimized for each individual patient.

[0016] Through these solutions, the present invention aims to implement an innovative AI-based diagnostic system that simultaneously achieves precision and personalization in gastric cancer risk assessment and maximizes the effects of data fusion.

[0017] Meanwhile, the technical problems of the present invention are not limited to those mentioned above, and other unmentioned technical problems will be clearly understood by a person skilled in the art from the description below.

[0018]

[0019] A method performed by a device operated by a processor according to one embodiment may include: acquiring training data including a patient’s gastric biopsy tissue image and survey text; generating image features from the image using an image encoder; generating text features from the text using a text encoder; determining a first combination applied to the heatmap with the highest similarity between an attention weight-based heatmap generated based on each combination of Query, Key, and Value applicable to a cross-attention mechanism based on the image features and the text features, and a bounding box labeled on the gastric biopsy tissue image; generating combined features based on the first combination using the cross-attention layer; and training an MLP model with inputs and outputs configured to predict a correct answer class for gastric cancer diagnosis from the combined features.

[0020] A method performed by a device operated by a processor according to one embodiment may include: acquiring training data including a patient’s gastric biopsy tissue image and lifestyle questionnaire text; classifying into a plurality of groups such that each group has a statistical distribution of uniform characteristics based on a statistical distribution of patient characteristics included in the training data; generating image features from the image based on an image encoder; generating text features from the text based on a text encoder; applying different combinations among each combination of Query, Key, and Value applicable to a cross-attention mechanism to each group based on the image features and the text features; generating combination features based on the combination applied to each group using the cross-attention layer; training each test MLP model with input and output set to predict a correct answer class for gastric cancer diagnosis from the combination features for each group; and training a final MLP model by extracting combination features for all training data based on a first combination of the group applied to the training of the test MLP model exhibiting the highest performance among each test MLP model.

[0021] A method performed by a device operated by a processor according to one embodiment may include: acquiring training data including a patient’s gastric biopsy tissue image and lifestyle questionnaire text; classifying into a plurality of groups such that each group has a statistical distribution of different characteristics based on a statistical distribution of patient characteristics included in the training data; generating image features from the image based on an image encoder; generating text features from the text based on a text encoder; determining a first combination with the highest performance for each group among each combination of Query, Key, and Value applicable to a cross-attention mechanism based on the image features and the text features; generating combined features for each group based on the first combination determined for each group using the cross-attention layer; and training an MLP model with inputs and outputs set to predict a correct answer class for gastric cancer diagnosis from the combined features of each group.

[0022]

[0023] The present invention provides an AI-based diagnostic technology for assessing gastric cancer risk, which can precisely evaluate and predict the risk of individual patients by fusing gastric biopsy tissue images and survey data using the cross-attention mechanism of a Transformer model.

[0024] Specifically, the present invention can provide the effect of enhancing the interpretability of a model through a technique that selects the optimal Q, K, V combination by analyzing the correlation between the concentration of attention similarity-based heatmaps for each Q, K, V combination and the gastric cancer lesion site. To this end, the present invention utilizes attention similarity-based heatmaps to evaluate how effectively a Transformer model focuses on pathological features, such as gastric cancer lesions, thereby increasing the reliability of diagnostic results and enabling medical professionals to understand the results more intuitively. Through this, the location and characteristics of the lesion can be identified more accurately, and the potential for early diagnosis and preventive treatment is increased.

[0025] Furthermore, the present invention can provide the effect of maximizing the diagnostic performance of an AI model through a technology that dynamically evaluates Q, K, and V combinations based on performance indicators (precision, recall, F1 score, etc.) and selects the optimal combination. To this end, the present invention generates various Q, K, and V combinations during the training phase and derives the optimal Q, K, and V settings by quantitatively comparing the performance of each combination, thereby simultaneously improving the efficiency of data training and the accuracy of diagnosis. This contributes to continuously improving the model's predictive ability by providing combinations optimized for data characteristics and the training environment in real time.

[0026] Furthermore, the present invention provides a technology that groups patients by clustering statistical distributions based on patient survey data and applies optimized Q, K, and V combinations for each group, thereby enabling personalized diagnosis and risk assessment. To this end, the present invention subdivides patient groups based on lifestyle data such as smoking, drinking, and exercise habits, and provides patient-tailored diagnostic results by applying optimal Q, K, and V settings suited to the characteristics of each group. Through this, personalized prevention and management strategies can be effectively presented, and the quality of medical services can be improved and patient satisfaction increased.

[0027] Thus, the present invention can provide an innovative AI-based diagnostic system that simultaneously realizes precision and personalization in gastric cancer risk assessment through the cross-attention mechanism of a transformer model and optimized Q, K, and V setting technology. This enhances the possibility of early diagnosis and treatment, and demonstrates significant technical effects capable of presenting a new paradigm in medical data analysis.

[0028] Meanwhile, the effects of the present invention are not limited to those mentioned above, and other unmentioned technical effects will be clearly understood by a person skilled in the art from the description below.

[0029]

[0030] FIG. 1 is a configuration diagram of a gastric cancer prediction device according to one embodiment.

[0031] FIG. 2 is a flowchart showing the steps of an operation performed by the gastric cancer prediction device of the present invention according to a first embodiment.

[0032] FIG. 3 is a conceptual diagram showing the overall architecture of the gastric cancer prediction device of the present invention, including an encoder and a neural network model controlled according to the operation of FIG. 2, and the flow of data processed according to the operation of FIG. 2.

[0033] FIG. 4 is a flowchart showing the steps of an operation performed by the gastric cancer prediction device of the present invention according to a second embodiment.

[0034] FIG. 5 is a conceptual diagram showing the overall architecture of the gastric cancer prediction device of the present invention, including an encoder and a neural network model controlled according to the operation of FIG. 4, and the flow of data processed according to the operation of FIG. 4.

[0035] FIG. 6 is a flowchart showing the steps of an operation performed by the gastric cancer prediction device of the present invention according to a third embodiment.

[0036] FIG. 7 is a conceptual diagram showing the overall architecture of the gastric cancer prediction device of the present invention, including an encoder and a neural network model controlled according to the operation of FIG. 6, and the flow of data processed according to the operation of FIG. 6.

[0037]

[0038] Detailed information regarding the purpose, technical configuration, and resulting effects of the present invention will be more clearly understood through the following detailed description based on the drawings attached to the specification of the present invention. An embodiment according to the present invention will be described in detail with reference to the attached drawings.

[0039] The embodiments disclosed herein should not be interpreted or used to limit the scope of the invention. It is obvious to those skilled in the art that the description including the embodiments herein has various applications. Accordingly, any embodiments described in the detailed description of the invention are illustrative for better explaining the invention and are not intended to limit the scope of the invention to the embodiments.

[0040] The functional blocks shown in the drawings and described below are merely examples of possible implementations. In other implementations, other functional blocks may be used without departing from the spirit and scope of the detailed description. Additionally, while one or more functional blocks of the present invention are shown as individual blocks, one or more of the functional blocks of the present invention may be a combination of various hardware and software configurations that perform the same function.

[0041] Furthermore, the expression that it includes certain components is an “open-ended” expression that merely refers to the existence of such components and should not be understood as excluding additional components.

[0042] Furthermore, when it is stated that one component is “connected” or “joined” to another component, it should be understood that while it may be directly connected or joined to that other component, there may also be other components present in between.

[0043] Hereinafter, various embodiments of the present invention are described with reference to the accompanying drawings. However, this is not intended to limit the present invention to specific embodiments and should be understood to include various modifications, equivalents, and / or alternatives of the embodiments of the present invention.

[0044] The present invention proposes a gastric cancer prediction device (100) that implements an AI-based diagnostic technology capable of fusion and analyzing gastric biopsy tissue images and patient survey data to improve the accuracy and reliability of gastric cancer risk assessment.

[0045] Hereinafter, we will examine the configuration of the gastric cancer prediction device (100) of the present invention and the operation of each configuration.

[0046] FIG. 1 is a configuration diagram of a gastric cancer prediction device (100) (hereinafter referred to as 'device (100)') according to one embodiment.

[0047] Referring to FIG. 1, a device (100) according to one embodiment may each include a memory (110), a processor (120), an input / output interface (130), and a communication interface (140).

[0048] The memory (110) can store data obtained from an external device or data generated by itself. The memory (110) can store instructions that can perform operations of the processor (120). For example, the memory (110) can store training data including a patient’s gastric biopsy image and survey text appearing in the content of the operation to be described later, an image encoder, a text encoder, a cross-attention layer, an MLP model, etc.

[0049] The processor (120) is a computational device that controls overall operation. The processor (120) can execute instructions stored in memory (110). The operation of the user terminal (10) and the device (100) according to the embodiment of the present document can be understood as an operation performed by the processor (120).

[0050] The input / output interface (130) may include a hardware interface or a software interface for inputting or outputting information.

[0051] The communication interface (140) enables the transmission and reception of information through a communication network. To this end, the communication interface (140) may include a wireless communication module or a wired communication module.

[0052] The device (100) can be implemented in various forms of devices capable of performing calculations through a processor (120) and transmitting and receiving information through a network. For example, it can be implemented in the form of a server, a computer device, a portable communication device, a smartphone, a portable multimedia device, a laptop, a tablet PC, etc., but is not limited to these examples.

[0053] The embodiments performed by the device (100) of the present invention can be broadly divided into three embodiments: a first embodiment, a second embodiment, and a third embodiment.

[0054] The first embodiment is an embodiment that improves the prediction accuracy of gastric cancer diagnosis by determining the optimal Q, K, V combination using a heatmap generated through attention similarity based on a cross-attention mechanism during the analysis of complex data including images and text. The first embodiment is described together with FIGS. 2 and 3.

[0055] The second embodiment is an embodiment that improves the prediction accuracy of gastric cancer diagnosis by applying the optimal Q, K, and V combination used in the cross-attention mechanism based on dynamic performance analysis of a test model during complex data analysis including images and text. The second embodiment is described together with FIGS. 4 and 5.

[0056] The third embodiment is an embodiment that improves the prediction accuracy of gastric cancer diagnosis by clustering training data based on the statistical distribution of patients and applying the optimal combination of Q, K, and V used in the cross-attention mechanism to each group. The third embodiment is described together with FIGS. 6 and 7.

[0057] Meanwhile, it goes without saying that the contents described in each of the first to third embodiments may be applied to other embodiments.

[0058] FIG. 2 is a flowchart showing the steps of an operation performed by the device (100) of the present invention according to a first embodiment. FIG. 3 is a conceptual diagram showing the overall architecture including an encoder and a neural network model controlled by the device (100) according to one embodiment according to the operation of FIG. 2, and the flow of data processed according to the operation of FIG. 2. The operation of the device (100) according to the embodiments of FIG. 2 and FIG. 3 can be understood as an operation performed by a processor (120).

[0059] Meanwhile, the overall architecture illustrated in FIG. 3 represents the configuration of multiple encoders and neural network models used in the first embodiment. That is, the neural network model used in the present invention is not composed of a single model, but is designed as a structure that operates through the interaction of multiple encoders, layers, and neural network models. Some of the encoders, layers, and neural network models included in the overall architecture may be pre-trained models. Additionally, some of the encoders, layers, and neural network models included in the overall architecture may be untrained models. In this case, the untrained models included in the overall architecture may be supervised learning using an end-to-end learning method, and the parameters of the pre-trained models may be fine-tuned through end-to-end learning. End-to-end learning refers to a method of training parameters within each encoder and neural network through a single integrated supervised learning process for all layers and neural network models from the input data to the final output data. The encoder, layers, and neural network model of the entire architecture trained in this manner will be collectively referred to as the 'gastric cancer prediction model'.

[0060] Although the encoder, layer, and neural network model of the first embodiment are described by exemplifying that a pre-trained model or an initial model may be applied to each, the encoder and neural network models appearing in the embodiments of the present invention are not necessarily limited to the pre-trained model or initial model exemplified in the description, and neural networks in various learning states may be applied depending on the designer's choice.

[0061] Meanwhile, the operation of Fig. 2 describes the process of determining the optimal combination of Q, K, and V using a heatmap generated through attention similarity based on a cross-attention mechanism during the process of training a neural network model included in the overall architecture of Fig. 3.

[0062] Meanwhile, each step disclosed in FIGS. 2 and FIGS. 3 is merely a preferred embodiment for achieving the purpose of the present invention, and some steps may be added or deleted as needed, and any one step may be included in another step. The order of each operation disclosed in FIGS. 2 and FIGS. 3 is arranged only for convenience of understanding and is not limited to a chronological order, and the order may be changed and operated differently according to the designer's choice.

[0063] Referring to FIG. 2 and FIG. 3 together, in step S1010, the device (100) can acquire training data including an image of the patient's gastric biopsy tissue and a survey text.

[0064] For example, a gastric biopsy tissue image is a high-resolution digital image obtained by biopsying a patient's gastric tissue, and it may include a bounding box for the region containing the tumor to identify the presence and location of the tumor. Additionally, this bounding box is mapped to a ground truth class representing a gastric cancer diagnosis result, enabling the gastric cancer prediction model to learn accurate diagnostic criteria during the training process.

[0065] For example, the survey text includes question-and-answer data regarding the patient's lifestyle habits (e.g., smoking, drinking, dietary habits, etc.), physical information (e.g., age, body mass index, blood pressure, etc.), and family history (e.g., cancer history of parents or siblings, etc.). This survey text is mapped to gastric cancer diagnostic results corresponding to each patient, enabling the gastric cancer prediction model to learn diagnostic criteria that reflect the patient's environmental and genetic factors during the learning process.

[0066] The device (100) uses this training data to generate combined features that reflect the association between different types of data, such as gastric biopsy tissue images and survey text, according to the operation to be described later, thereby training a gastric cancer prediction model that can simultaneously consider the impact of each different type of data on gastric cancer diagnosis.

[0067] In step S1020, the device (100) can generate image features from the gastric biopsy tissue image using an image encoder.

[0068] For example, the image encoder includes a Vision Transformer model, and the Vision Transformer model may be a model pre-trained to generate image features composed of multidimensional vectors from the gastric biopsy tissue image.

[0069] In this case, the pre-trained Vision Transformer model can be optimized during the pre-training process to effectively extract pathological image feature vectors, such as tumors, from gastric biopsy tissue images using a large-scale pathology image dataset. Furthermore, the pre-trained Vision Transformer model adopts a patch-based processing method capable of precisely analyzing the fine structural patterns of gastric tissue and the location of lesions, thereby enabling the extraction of features that simultaneously reflect global and local information from the image data.

[0070] In step S1030, the device (100) can generate text features from survey text using a text encoder.

[0071] For example, a text encoder includes a Transformer model, and the Transformer model may be a pre-trained model designed to generate text features consisting of multidimensional vectors from survey text.

[0072] In this case, the pre-trained Transformer model can be optimized during the pre-training process based on a large-scale text dataset to extract text feature vectors that accurately reflect information such as the patient's lifestyle habits, physical characteristics, and family history contained in the survey text. For example, the Transformer model can generate feature vectors by utilizing a self-attention mechanism to learn contextual information from the text data and identifying correlations between important words and sentences within the survey text. For instance, the Transformer model can generate multidimensional vectors by reflecting the interdependencies between items such as the patient's smoking status, exercise frequency, and family history. These text features provide meaningful information for gastric cancer diagnosis and enable the training of a more accurate gastric cancer prediction model through subsequent combination with image features.

[0073] In step S1040, the device (100) can generate each combination of Query (hereinafter referred to as 'Q'), Key (hereinafter referred to as 'K'), and Value (hereinafter referred to as 'V') applicable to a cross-attention mechanism based on the image features generated in S1020 and the text features generated in S1030.

[0074] Here, the cross-attention mechanism is a mechanism that learns the interaction between two different types of datasets (e.g., gastric biopsy images, survey text) and generates combined features by reflecting this interaction. It calculates the similarity between the data through the inner product of Q and K, and generates the final combined features by weighting V with weights normalized by the Softmax function. Here, Q is the central data of the analysis, representing the region of interest and acting as a query for learning the associations between the data. K is the reference data used to measure similarity with Q, and its similarity can be calculated through the inner product of Q and K. V is the data reflecting the interaction results of Q and K and is used as the input for the weighted sum to generate the final combined features.

[0075] The cross-attention mechanism can be implemented through the cross-attention layer included in the Transformer model. A cross-attention layer is a layer that implements the cross-attention mechanism; it receives feature vectors set as Q, K, and V, respectively, calculates attention weights, and generates combined features. The cross-attention layer is a key component that selectively focuses on and integrates important information from different datasets. For example, if a specific keyword in text data has a high correlation with a specific region in image data, the cross-attention layer reflects this correlation in the combined features.

[0076] Accordingly, the device (100) can generate six combinations of Query, Key, and Value composed of either an image feature or a text feature, such as "Q: image vector, K: image vector, V: text vector", "Q: image vector, K: text vector, V: image vector", "Q: text vector, K: image vector, V: image vector", "Q: text vector, K: text vector, V: image vector", "Q: text vector, K: image vector, V: text vector", "Q: image vector, K: text vector, V: text vector".

[0077] Meanwhile, since the performance of the MLP model and the overall gastric cancer prediction model described later may vary depending on which of these six combinations is selected to generate combined features, the process of determining the optimal combination is very important. This is because the Q, K, and V combinations are key factors in determining which data the model focuses on and which data it uses secondarily when learning the interaction between images and text.

[0078] Furthermore, since the optimal combination of Q, K, and V may differ for each given training data, it is necessary to determine the optimal combination based on the characteristics and circumstances of the given training data, rather than always using a single fixed combination.

[0079] In one embodiment, the device (100) can determine the optimal combination of Q, K, and V suitable for the training data given in step S1010 through the following steps S1041 to S1044 based on attention weight-based heatmap analysis.

[0080] In S1041 to S1044, the device (100) generates an attention weight-based heatmap for each of all applicable combinations of Query, Key, and Value, and can determine a first combination applied to the heatmap with the highest similarity between the heatmap of each combination and the bounding box labeled on the upper biopsy tissue image of the training data used to generate the heatmap. This is described step-by-step as follows.

[0081] In step S1041, the device (100) can generate an attention weight-based heatmap for each QKV combination based on training data. For example, the device (100) sets a Query, Key, and Value composed of either image features or text features extracted from the training data. Additionally, the device (100) can calculate the similarity between the Query and Key based on the Multi-Head Attention algorithm of the Transformer model and visualize the attention weights normalized by the Softmax function of the calculated similarity. Accordingly, the device (100) can generate a heatmap by synthesizing the gastric biopsy images of the training data used and the visualized attention information.

[0082] In this case, the generated heatmap represents an attention score (e.g., expressed as a color spectrum) indicating how much the attention mechanism focused on specific areas of the gastric biopsy image. For instance, areas with high color intensity in the heatmap signify regions with high attention weights, indicating that the model regards those areas as important information. This attention score can highlight diagnosis-related areas, such as specific lesions in the image (e.g., tumor regions) or specific keywords in the text data (e.g., smoking, drinking status).

[0083] In step S1042, the device (100) can extract an attention region in the heatmap for each QKV combination in which the attention score is greater than or equal to a predetermined threshold (e.g., greater than a specific point score, greater than or equal to a specific color spectrum reference area). In this case, the attention region specifies an area where the attention mechanism has focused on a specific part of the data.

[0084] In step S1043, the device (100) can calculate an Intersection Over Union (IOU) value between the attention area of each combination heatmap and the bounding box labeled on the upper biopsy tissue image of the training data used to generate the heatmap. Here, the IOU value is the value obtained by dividing the intersection area of the attention area and the bounding box by the union area of the two areas, and is an indicator of how accurately the two areas overlap. That is, a large IOU value means that the attention mechanism is learning well the features in the training data that have a high correlation with the tumor area.

[0085] In step S1044, the device (100) may determine the combination applied to the heatmap with the largest calculated IOU value for each combination as the first combination. The first combination represents a QKV combination that most effectively reflects the interaction between images and text in the corresponding training data, and this combination enables the cross-attention mechanism to combine pathological and environmental information of the training data in an optimized manner.

[0086] Meanwhile, for the sake of convenience of understanding, the description of the heatmap generation and similarity comparison of S1041 to S1044 described above explained an example in which a heatmap is generated and similarity is compared based on a single training data. If the heatmap generation and similarity comparison of S1041 to S1044 are performed based on multiple training data, it is as follows.

[0087] The device (100) generates multiple heatmaps based on attention weights for each combination based on multiple training data, and can extract attention regions in which the attention score is greater than or equal to a predetermined threshold in each heatmap for each combination. At this time, the device (100) can calculate the Intersection Over Union (IOU) values between the attention regions of each heatmap for each combination and the bounding boxes labeled on the upper biopsy tissue images of each training data used to generate each heatmap, and then calculate the average. Accordingly, the device (100) can determine the combination applied to the heatmap with the largest average of the calculated IOU values for each heatmap for each combination as the first combination.

[0088] In another embodiment, the device (100) can determine the optimal combination of Q, K, and V suitable for the training data given in step S1010 through the following steps S1045 to S1048 based on attention weight-based heatmap analysis.

[0089] In step S1045, the device (100) can generate an attention weight-based heatmap for each combination based on the training data. Since S1045 is identical to the method of step S1041 described above, the explanation of redundant content is omitted.

[0090] In step S1046, the device (100) can extract an attention region in each combination heatmap where the attention score is greater than or equal to a predetermined threshold. Since S1046 is identical to the method of step S1042 described above, the description of redundant content is omitted.

[0091] In step S1047, the device (100) can calculate the inner product between a first feature vector that vectorizes the attention area of each combination heatmap and a second feature vector that vectorizes the bounding box labeled on the upper biopsy tissue image of the training data used to generate the heatmap.

[0092] For example, a first feature vector can be generated by extracting attention weight values in pixel units from the attention areas of each heatmap and converting them into a vector form. For example, the device (100) can generate the first feature vector by arranging the weight values of pixels corresponding to the attention areas in the heatmap in a specific order or by arranging them into a multidimensional vector that reflects the spatial structure. This first feature vector numerically expresses the degree to which the attention mechanism focuses on a specific data area and consists of information that includes the size, density, and intensity of the attention area.

[0093] For example, a second feature vector is generated by vectorizing bounding boxes labeled on the gastric biopsy tissue image. The vectorization of the bounding boxes can be designed to include information about the image within the bounding box region and the ground truth class (e.g., presence and stage of a tumor).

[0094] The device (100) can quantitatively evaluate the similarity between the two data by calculating the inner product between the first feature vector and the second feature vector generated in this way. The higher the inner product value, the stronger the correlation between the attention region and the bounding box, which may indicate that the attention mechanism is successfully reflecting important pathological information in the training data. In other words, a large inner product value means that the attention mechanism is effectively learning features in the training data that have a high correlation with the tumor region.

[0095] In step S1048, the device (100) may determine the combination applied to the heatmap with the largest inner product value calculated for each combination as the first combination. The first combination represents a QKV combination that most effectively reflects the interaction between images and text in the corresponding training data, and this combination enables the cross-attention mechanism to combine pathological and environmental information of the training data in an optimized manner.

[0096] Meanwhile, the description of the heatmap generation and similarity comparison of S1045 to S1048 described above explained an example where the comparison is based on a single training data for the sake of ease of understanding. If the heatmap generation and similarity comparison of S1045 to S1048 are performed based on multiple training data, it is as follows.

[0097] The device (100) can generate a plurality of attention weight-based heatmaps for each combination based on a plurality of training data, and extract attention regions in each combination where the attention score is greater than or equal to a predetermined threshold.

[0098] At this time, the device (100) can calculate the inner product between a first feature vector formed by combining each of the attention regions of a plurality of heatmaps for each combination and a second feature vector formed by vectorizing bounding boxes labeled on the upper biopsy tissue images of the training data used to generate the heatmaps.

[0099] At this time, during the process of combining the first feature vectors, the device (100) can apply a Principal Component Analysis (PCA) technique to reduce the dimensionality of each vector of the attention regions of multiple heatmaps and combine them, thereby optimizing the combined features to effectively reflect common or differentiating information between the training data. For example, the PCA technique is designed to reduce the number of dimensions of the attention region vectors while preserving key information so that the combined first feature vector represents the main learning pattern of the attention mechanism. Additionally, the correlation between the attention regions is analyzed during the dimensionality reduction process to further emphasize information that appears consistently in specific combinations (e.g., pathological patterns related to tumor location).

[0100] Accordingly, the device (100) can determine the combination applied to the heatmap with the largest inner product value calculated for each combination of multiple heatmaps as the first combination.

[0101] In step S1050, the device (100) can generate a first combination-based combination feature using a cross-attention layer.

[0102] For example, the first combination is the optimal combination of Q, K, and V determined in steps S1041 to S1044, or S1045 to S1048, which is the combination capable of most effectively learning the interaction between image features and text features. The device (100) inputs Q, K, and V, which are set as the first combination, into a cross-attention layer to calculate the similarity between Q and K, and can generate attention weights by normalizing the calculated similarity with a Softmax function. The device (100) can generate combined features containing interaction information between images and text by applying V as a weighted sum to the attention weights generated through the cross-attention layer.

[0103] The combined features generated in this way integrally reflect pathological information from image data (e.g., location and morphology of tumors) and environmental / genetic factors from text data (e.g., smoking, family history), providing a foundation for learning the complex relationships between the two datasets. For instance, if the information "has smoking experience" in the text data is associated with a specific lesion location in the image data, the cross-attention layer can improve the accuracy of gastric cancer diagnosis by reflecting this interaction in the combined features. Furthermore, the combined features are represented as multidimensional vectors and are utilized as primary input data for training the MLP model in subsequent stages.

[0104] In step S1060, the device (100) can train an MLP model with inputs and outputs configured to predict a correct class for gastric cancer diagnosis from combined features.

[0105] An MLP model is an artificial neural network consisting of an input layer, one or more hidden layers, and an output layer. An MLP model processes multidimensional vectors of input data to learn complex non-linear relationships and derives desired prediction values in the output layer based on this. An MLP model transforms data by applying activation functions in each layer, and generally, non-linear activation functions such as ReLU (Rectified Linear Unit) or Sigmoid are used. Additionally, the neurons in the hidden layer and the neurons in the output layer are connected in a fully connected manner, so that input features can be learned throughout the model. In the present invention, the MLP model can be configured to receive combined features as input and predict the correct answer class (e.g., "benign," "malignant," "high-risk group," etc.) for gastric cancer diagnosis.

[0106] For example, the device (100) may set a predetermined objective function in the MLP model. For example, the objective function may include a cross-entropy loss designed to maximize the prediction accuracy of the MLP model for gastric cancer diagnosis, and the objective function may quantitatively calculate the discrepancy between the correct class of the training data and the predicted value.

[0107] Using this objective function, the device (100) can train the parameters of the MLP model so that the loss between the predicted value of the MLP model for the combined features generated based on the first combination and the correct class of the training data is minimized. For example, the training of the MLP model can be performed through a gradient descent algorithm, and the parameters of the MLP model can be optimized by calculating the gradient of the loss function.

[0108] Meanwhile, the device (100) may train only the parameters of the MLP model, but according to the embodiment, end-to-end training may be performed so that each parameter included in the image encoder of S1020, the text encoder of S1030, the cross-attention layer of S1050, and the MLP model is updated. In this case, the parameters of the pre-trained image encoder of S1020, the text encoder of S1030, and the cross-attention layer of S1050 are fine-tuned in a direction that minimizes the loss of the objective function of the MLP model.

[0109] For example, an image encoder can be tuned to more precisely extract pathological features, such as the location or morphology of tumors, from gastric biopsy tissue images, and a text encoder can be trained to better reflect keywords within survey text that have a direct correlation with the diagnosis. In the case of the cross-attention layer, attention weights can be optimized to emphasize important relationships in the interaction between images and text. Such end-to-end learning can contribute to improving the overall predictive performance of the model by integrally reflecting the diverse characteristics of the training data.

[0110] FIG. 4 is a flowchart showing the steps of an operation performed by the gastric cancer prediction device of the present invention according to a second embodiment. FIG. 5 is a conceptual diagram showing the overall architecture including an encoder and a neural network model controlled by the gastric cancer prediction device of the present invention according to the operation of FIG. 4, and the flow of data processed according to the operation of FIG. 4. The operation of the device (100) according to the embodiments of FIG. 4 and FIG. 5 can be understood as an operation performed by a processor (120).

[0111] Meanwhile, the overall architecture illustrated in FIG. 5 represents the configuration of multiple encoders and neural network models used in the second embodiment. That is, the neural network model used in the present invention is not composed of a single model, but is designed as a structure that operates through the interaction of multiple encoders, layers, and neural network models. Some of the encoders, layers, and neural network models included in the overall architecture may be pre-trained models. Additionally, some of the encoders, layers, and neural network models included in the overall architecture may be untrained models. In this case, the untrained models included in the overall architecture may be supervised learning using an end-to-end learning method, and the parameters of the pre-trained models may be fine-tuned through end-to-end learning. End-to-end learning refers to a method of training parameters within each encoder and neural network through a single integrated supervised learning process for all layers and neural network models from the input data to the final output data.

[0112] Although the encoder, layer, and neural network models of the second embodiment are described by exemplifying that a pre-trained model or an initial model may be applied to each, the encoder and neural network models appearing in the embodiments of the present invention are not necessarily limited to the pre-trained model or initial model exemplified in the description, and neural networks in various learning states may be applied depending on the designer's choice.

[0113] Meanwhile, the operation of Fig. 4 describes the process of training a neural network model included in the overall architecture of Fig. 5 by applying the optimal combination of Q, K, and V used in the cross-attention mechanism based on dynamic performance analysis of the test model.

[0114] Meanwhile, each step disclosed in FIGS. 4 and FIGS. 5 is merely a preferred embodiment for achieving the purpose of the present invention, and some steps may be added or deleted as needed, and any one step may be included in another step. The order of each operation disclosed in FIGS. 4 and FIGS. 5 is arranged only for convenience of understanding and is not limited to a chronological order, and the order may be changed and operated differently according to the designer's choice.

[0115] Referring to FIGS. 4 and FIGS. 5 together, in step S2010, the device (100) can acquire training data including an image of the patient's gastric biopsy tissue and a survey text.

[0116] For example, a gastric biopsy tissue image is a high-resolution digital image obtained by biopsying a patient's gastric tissue, and it may include a bounding box for the region containing the tumor to identify the presence and location of the tumor. Additionally, this bounding box is mapped to a ground truth class representing a gastric cancer diagnosis result, enabling the gastric cancer prediction model to learn accurate diagnostic criteria during the training process.

[0117] For example, the survey text includes question-and-answer data regarding the patient's lifestyle habits (e.g., smoking, drinking, dietary habits, etc.), physical information (e.g., age, body mass index, blood pressure, etc.), and family history (e.g., cancer history of parents or siblings, etc.). This survey text is mapped to gastric cancer diagnostic results corresponding to each patient, enabling the gastric cancer prediction model to learn diagnostic criteria that reflect the patient's environmental and genetic factors during the learning process.

[0118] The device (100) uses this training data to generate combined features that reflect the association between different types of data, such as gastric biopsy tissue images and survey text, according to the operation to be described later, thereby training a gastric cancer prediction model that can simultaneously consider the impact of each different type of data on gastric cancer diagnosis.

[0119] In step S2011, the device (100) may classify the patient characteristics included in the training data into multiple groups based on the statistical distribution of the patient characteristics, such that each group has a statistical distribution of uniform characteristics. This is intended to prevent the selection of a combination of Q, K, V due to imbalance in the training data when determining the optimal combination of Q, K, V in the second embodiment, thereby preventing a specific data pattern or group from being excessively learned during the model training process.

[0120] For example, the device (100) can classify patient training data into k clusters (k is a natural number greater than or equal to 2) by applying a k-clustering algorithm based on at least one feature among the patient's lifestyle habits (e.g., smoking, drinking, exercise frequency, etc.), physical information (e.g., age, body mass index, blood pressure, etc.) and family history (e.g., history of specific diseases) included in the survey text. The k-clustering algorithm calculates the similarity between data based on each patient's multidimensional vector data and classifies patient data with similar characteristics into the same cluster.

[0121] Accordingly, the device (100) can sequentially extract a predetermined number of training data of patients classified into each of the k clusters for each cluster to create multiple groups in which the training data classified into each cluster is evenly included.

[0122] For example, assume that three clusters (Cluster A, Cluster B, and Cluster C) were generated as a result of applying a k-clustering algorithm based on patients' lifestyle habits, physical information, and family history. Cluster A is classified as a group of patients with high frequency of smoking and drinking and low frequency of exercise; Cluster B is classified as a group of patients with healthy lifestyle habits who exercise frequently, have a normal BMI, and have no history of smoking or drinking; and Cluster C is classified as a group of patients with a family history of cancer.

[0123] The device (100) can create groups by sequentially selecting data from each cluster in a round-robin manner. For example, the device (100) can form multiple groups (e.g., 6 groups) that each equally include data from patients with smoking and drinking experience in Cluster A, patients who exercise regularly in Cluster B, and patients with a family history of cancer in Cluster C. Through this, each group evenly includes the statistical distribution of all clusters, thereby preventing misjudgment due to data bias when determining the optimal combination of Q, K, and V in the operation to be described later in the second embodiment.

[0124] In step S2020, the device (100) can generate image features from the gastric biopsy tissue image using an image encoder.

[0125] For example, the image encoder includes a Vision Transformer model, and the Vision Transformer model may be a model pre-trained to generate image features composed of multidimensional vectors from the gastric biopsy tissue image.

[0126] In this case, the pre-trained Vision Transformer model can be optimized during the pre-training process to effectively extract pathological image feature vectors, such as tumors, from gastric biopsy tissue images using a large-scale pathology image dataset. Furthermore, the pre-trained Vision Transformer model adopts a patch-based processing method capable of precisely analyzing the fine structural patterns of gastric tissue and the location of lesions, thereby enabling the extraction of features that simultaneously reflect global and local information from the image data.

[0127] In step S2030, the device (100) can generate text features from survey text using a text encoder.

[0128] For example, a text encoder includes a Transformer model, and the Transformer model may be a pre-trained model designed to generate text features consisting of multidimensional vectors from survey text.

[0129] In this case, the pre-trained Transformer model can be optimized during the pre-training process based on a large-scale text dataset to extract text feature vectors that accurately reflect information such as the patient's lifestyle habits, physical characteristics, and family history contained in the survey text. For example, the Transformer model can generate feature vectors by utilizing a self-attention mechanism to learn contextual information from the text data and identifying correlations between important words and sentences within the survey text. For instance, the Transformer model can generate multidimensional vectors by reflecting the interdependencies between items such as the patient's smoking status, exercise frequency, and family history. These text features provide meaningful information for gastric cancer diagnosis and enable the training of a more accurate gastric cancer prediction model through subsequent combination with image features.

[0130] In step S2040, the device (100) can apply different combinations among combinations of Query (hereinafter referred to as 'Q'), Key (hereinafter referred to as 'K'), and Value (hereinafter referred to as 'V') applicable to the cross-attention mechanism to each group based on image features generated in S2020 and text features generated in S2030. That is, since there are 6 applicable combinations of Q, K, and V, the device (100) can apply different combinations to multiple groups (e.g., 6 groups) generated in step S2011.

[0131] Here, the cross-attention mechanism is a mechanism that learns the interaction between two different types of datasets (e.g., gastric biopsy images, survey text) and generates combined features by reflecting this interaction. It calculates the similarity between the data through the inner product of Q and K, and generates the final combined features by weighting V with weights normalized by the Softmax function. Here, Q is the central data of the analysis, representing the region of interest and acting as a query for learning the associations between the data. K is the reference data used to measure similarity with Q, and its similarity can be calculated through the inner product of Q and K. V is the data reflecting the interaction results of Q and K and is used as the input for the weighted sum to generate the final combined features.

[0132] The cross-attention mechanism can be implemented through the cross-attention layer included in the Transformer model. A cross-attention layer is a layer that implements the cross-attention mechanism; it receives feature vectors set as Q, K, and V, respectively, calculates attention weights, and generates combined features. The cross-attention layer is a key component that selectively focuses on and integrates important information from different datasets. For example, if a specific keyword in text data has a high correlation with a specific region in image data, the cross-attention layer reflects this correlation in the combined features.

[0133] Accordingly, the device (100) can generate six combinations of Query, Key, and Value composed of either an image feature or a text feature, such as "Q: image vector, K: image vector, V: text vector", "Q: image vector, K: text vector, V: image vector", "Q: text vector, K: image vector, V: image vector", "Q: text vector, K: text vector, V: image vector", "Q: text vector, K: text vector, V: text vector". At this time, the device (100) can apply different combinations of Q, K, and V to multiple groups (e.g., six groups) generated in step S2011.

[0134] Meanwhile, since the performance of the final MLP model and the entire gastric cancer prediction model described later may vary depending on which of these six combinations is selected to generate combined features, the process of determining the optimal combination is very important. This is because the Q, K, and V combinations are key factors in determining which data the model focuses on and which data it uses secondarily when learning the interaction between images and text.

[0135] In step S2050, the device (100) can use a cross-attention layer to generate combined features based on the combination of Q, K, and V applied to each group generated in step S2011.

[0136] The device (100) can generate attention weights by using image features and text features generated using training data classified by each group, inputting Q, K, and V set as combinations applied to each group into a cross-attention layer, calculating the similarity between Q and K, and normalizing the calculated similarity with a Softmax function. The device (100) can generate combined features containing interaction information between images and text by applying V as a weighted sum to the attention weights generated through the cross-attention layer.

[0137] The combined features generated in this way integrally reflect pathological information of image data (e.g., location and shape of tumors) and environmental / genetic factors of text data (e.g., smoking, family history, etc.), and provide a basis for learning complex relationships between the two data. For example, if the information "has smoking experience" in the text data is associated with a specific lesion location in the image data, the cross-attention layer can improve the accuracy of gastric cancer diagnosis by reflecting such interactions in the combined features. Additionally, the combined features are represented as multidimensional vectors and are utilized as key input data for training an MLP model in a subsequent step. In the second embodiment, the device (100) determines, as follows, which combination of combined features generated from the combinations of Q, K, and V applied to each group best reflects the interaction between the image data and the text data.

[0138] In step S2060, the device (100) trains each test MLP model, with inputs and outputs configured to predict the correct answer class for gastric cancer diagnosis from the combined features of each group, and can determine the performance of each test MLP model. Unlike the final MLP model that appears in step S2070, the test MLP model is a neural network model used only in the training phase to test the performance of the Q, K, and V combinations applied to each group.

[0139] For example, the device (100) can generate a test MLP model for each group with inputs and outputs configured to predict a correct class for gastric cancer diagnosis from combined features.

[0140] Next, the device (100) can perform supervised learning to update the parameters included in each test MLP model so that the loss between the predicted value of the test MLP model for each group's combined feature and the correct class mapped to the training data for each group is minimized based on a predetermined objective function. For example, the objective function may include a cross-entropy loss designed to maximize the prediction accuracy of the test MLP model for gastric cancer diagnosis, and the objective function may quantitatively calculate the discrepancy between the correct class of the training data and the predicted value. The training of the test MLP model may be performed through a gradient descent algorithm, and the parameters of each test MLP model may be optimized by calculating the gradient of the loss function.

[0141] Accordingly, the device (100) can evaluate the diagnostic performance of each combination by measuring the precision, recall, and F1 score of each test MLP model that has been trained.

[0142] In addition, the device (100) can evaluate the performance of each test MLP model as follows and determine the first combination, which is the optimal combination of Q, K, and V.

[0143] For example, the device (100) can calculate a performance score through a weighted sum of the precision, recall, and F1 scores of each test MLP model. In this case, the weights applied to each precision, recall, and F1 score may be pre-set differently depending on the importance targeted by the designer.

[0144] Next, the device (100) can determine the first combination applied to the first group applied to the training of the test MLP model with the highest weighted sum of precision, recall, and F1 scores for each test MLP model.

[0145] Accordingly, the device (100) generates the following final MLP model using the first combination-based combination feature for all training data in the next operation.

[0146] In step S2070, the device (100) can train the final MLP model by extracting combined features for all training data based on a first combination of groups applied to the training of the test MLP model that exhibits the highest performance among each test MLP model.

[0147] For example, the device (100) may set a predetermined objective function in the final MLP model. For example, the objective function may include a cross-entropy loss designed to maximize the prediction accuracy of the final MLP model for gastric cancer diagnosis, and the objective function may quantitatively calculate the discrepancy between the correct class and the predicted value of the training data.

[0148] Using this objective function, the device (100) can train the parameters of the final MLP model such that the loss between the predicted value of the final MLP model for the combined features generated based on the first combination for all training data of S2010 and the correct class of the training data is minimized. For example, the training of the final MLP model can be performed through a gradient descent algorithm, and the parameters of the MLP model can be optimized by calculating the gradient of the loss function.

[0149] Meanwhile, the device (100) may train only the parameters of the final MLP model, but according to the embodiment, end-to-end training may be performed so that each parameter included in the image encoder of S2020, the text encoder of S2030, the cross-attention layer of S2050, and the final MLP model is updated. In this case, the parameters of the pre-trained image encoder of S2020, the text encoder of S2030, and the cross-attention layer of S2050 are fine-tuned in a direction that minimizes the loss of the objective function of the final MLP model.

[0150] For example, an image encoder can be tuned to more precisely extract pathological features, such as the location or morphology of tumors, from gastric biopsy tissue images, and a text encoder can be trained to better reflect keywords within survey text that have a direct correlation with the diagnosis. In the case of the cross-attention layer, attention weights can be optimized to emphasize important relationships in the interaction between images and text. Such end-to-end learning can contribute to improving the overall predictive performance of the model by integrally reflecting the diverse characteristics of the training data.

[0151] FIG. 6 is a flowchart showing the steps of an operation performed by the gastric cancer prediction device of the present invention according to a third embodiment. FIG. 7 is a conceptual diagram showing the overall architecture including an encoder and a neural network model controlled by the gastric cancer prediction device of the present invention according to the operation of FIG. 6, and the flow of data processed according to the operation of FIG. 6. The operation of the device (100) according to the embodiments of FIG. 6 and FIG. 7 can be understood as an operation performed by a processor (120).

[0152] Meanwhile, the overall architecture illustrated in FIG. 7 represents the configuration of multiple encoders and neural network models used in the third embodiment. That is, the neural network model used in the present invention is not composed of a single model, but is designed as a structure that operates through the interaction of multiple encoders, layers, and neural network models. Some of the encoders, layers, and neural network models included in the overall architecture may be pre-trained models. Additionally, some of the encoders, layers, and neural network models included in the overall architecture may be untrained models. In this case, the untrained models included in the overall architecture may be supervised learning using an end-to-end learning method, and the parameters of the pre-trained models may be fine-tuned through end-to-end learning. End-to-end learning refers to a method of training parameters within each encoder and neural network through a single integrated supervised learning process for all layers and neural network models from the input data to the final output data.

[0153] Although the encoder, layer, and neural network models of the third embodiment are described by exemplifying that a pre-trained model or an initial model may be applied to each, the encoder and neural network models appearing in the embodiments of the present invention are not necessarily limited to the pre-trained model or initial model exemplified in the description, and neural networks in various learning states may be applied depending on the designer's choice.

[0154] Meanwhile, the operation of Fig. 6 describes the process of clustering training data based on the statistical distribution of patients during the process of training the neural network model included in the overall architecture of Fig. 7, and applying the optimal combination of Q, K, and V used in the cross-attention mechanism to each group to proceed with training.

[0155] Meanwhile, each step disclosed in FIGS. 6 and FIGS. 7 is merely a preferred embodiment for achieving the purpose of the present invention, and some steps may be added or deleted as needed, and any one step may be included in another step. The order of each operation disclosed in FIGS. 6 and FIGS. 7 is arranged only for convenience of understanding and is not limited to a chronological order, and the order may be changed and operated differently according to the designer's choice.

[0156] Referring to FIGS. 6 and FIGS. 7 together, in step S3010, the device (100) can acquire training data including an image of the patient's gastric biopsy tissue and a survey text.

[0157] For example, a gastric biopsy tissue image is a high-resolution digital image obtained by biopsying a patient's gastric tissue, and it may include a bounding box for the region containing the tumor to identify the presence and location of the tumor. Additionally, this bounding box is mapped to a ground truth class representing a gastric cancer diagnosis result, enabling the gastric cancer prediction model to learn accurate diagnostic criteria during the training process.

[0158] For example, the survey text includes question-and-answer data regarding the patient's lifestyle habits (e.g., smoking, drinking, dietary habits, etc.), physical information (e.g., age, body mass index, blood pressure, etc.), and family history (e.g., cancer history of parents or siblings, etc.). This survey text is mapped to gastric cancer diagnostic results corresponding to each patient, enabling the gastric cancer prediction model to learn diagnostic criteria that reflect the patient's environmental and genetic factors during the learning process.

[0159] The device (100) uses this training data to generate combined features that reflect the association between different types of data, such as gastric biopsy tissue images and survey text, according to the operation to be described later, thereby training a gastric cancer prediction model that can simultaneously consider the impact of each different type of data on gastric cancer diagnosis.

[0160] In step S3011, the device (100) can classify into multiple groups such that each group has a statistical distribution of different characteristics based on the statistical distribution of patient characteristics included in the learning data.

[0161] For example, the device (100) can classify patient training data into k clusters (k is a natural number greater than or equal to 2) by applying a k-clustering algorithm based on at least one feature among the patient's lifestyle habits (e.g., smoking, drinking, exercise frequency, etc.), physical information (e.g., age, body mass index, blood pressure, etc.) and family history (e.g., history of specific diseases) included in the survey text. The k-clustering algorithm calculates the similarity between data based on each patient's multidimensional vector data and classifies patient data with similar characteristics into the same cluster.

[0162] Accordingly, the device (100) can group the learning data of patients classified into each of the k clusters according to the clusters, thereby generating multiple groups in which each group has a statistical distribution of different characteristics.

[0163] For example, assume that three clusters (Cluster A, Cluster B, and Cluster C) were generated as a result of applying a k-clustering algorithm based on patients' lifestyle habits, physical information, and family history. Cluster A is classified as a group of patients with high frequency of smoking and drinking and low frequency of exercise; Cluster B is classified as a group of patients with healthy lifestyle habits who exercise frequently, have a normal BMI, and have no history of smoking or drinking; and Cluster C is classified as a group of patients with a family history of cancer.

[0164] The device (100) can group the training data of patients who have a history of smoking and drinking in Cluster A, patients who exercise regularly in Cluster B, and patients who have a family history of cancer in Cluster C, respectively, to create three groups as Cluster A, Cluster B, and Cluster C.

[0165] Thus, classifying into multiple groups such that each group has a statistical distribution of different characteristics is intended to determine the optimal combination of Q, K, and V for each of the multiple groups having a statistical distribution of different characteristics in the third embodiment. That is, in the third embodiment, different combinations of Q, K, and V are applied to each group to generate combined features.

[0166] In step S3020, the device (100) can generate image features from the gastric biopsy tissue image using an image encoder.

[0167] For example, the image encoder includes a Vision Transformer model, and the Vision Transformer model may be a model pre-trained to generate image features composed of multidimensional vectors from the gastric biopsy tissue image.

[0168] In this case, the pre-trained Vision Transformer model can be optimized during the pre-training process to effectively extract pathological image feature vectors, such as tumors, from gastric biopsy tissue images using a large-scale pathology image dataset. Furthermore, the pre-trained Vision Transformer model adopts a patch-based processing method capable of precisely analyzing the fine structural patterns of gastric tissue and the location of lesions, thereby enabling the extraction of features that simultaneously reflect global and local information from the image data.

[0169] In step S3030, the device (100) can generate text features from survey text using a text encoder.

[0170] For example, a text encoder includes a Transformer model, and the Transformer model may be a pre-trained model designed to generate text features consisting of multidimensional vectors from survey text.

[0171] In this case, the pre-trained Transformer model can be optimized during the pre-training process based on a large-scale text dataset to extract text feature vectors that accurately reflect information such as the patient's lifestyle habits, physical characteristics, and family history contained in the survey text. For example, the Transformer model can generate feature vectors by utilizing a self-attention mechanism to learn contextual information from the text data and identifying correlations between important words and sentences within the survey text. For instance, the Transformer model can generate multidimensional vectors by reflecting the interdependencies between items such as the patient's smoking status, exercise frequency, and family history. These text features provide meaningful information for gastric cancer diagnosis and enable the training of a more accurate gastric cancer prediction model through subsequent combination with image features.

[0172] In step S3040, the device (100) can determine the first combination with the highest performance for each group among combinations of Query (hereinafter referred to as 'Q'), Key (hereinafter referred to as 'K'), and Value (hereinafter referred to as 'V') applicable to the cross-attention mechanism, based on image features generated in S3020 and text features generated in S3030.

[0173] For example, in step S3040, the device (100) can determine the first combination with the highest performance for each group classified in step S3011 through the methods of S1041 to S1044 of the first embodiment described above. At this time, the first combinations for each group determined may be the same or different from each other. A redundant description of S1041 to S1044 is omitted.

[0174] For example, in step S3040, the device (100) can determine the first combination with the highest performance for each group classified in step S3011 through the methods of S1045 to S1048 of the first embodiment described above. At this time, the first combinations for each group determined may be the same or different from each other. A redundant description of S1045 to S1048 is omitted.

[0175] Accordingly, in the third embodiment, by determining the optimal combination of Q, K, and V that is most suitable for the characteristics of the training data according to the different statistical distributions of each training data differently for each group, it is possible to generate a combined feature that effectively reflects the characteristics of each group.

[0176] In step S3050, the device (100) can generate combination features for each group based on a first combination determined for each group using a cross-attention layer. At this time, since a different first combination is applied for each group, the cross-attention mechanism can generate optimized combination features for each group by applying the first combination applied for each group.

[0177] For example, in a specific group (e.g., a patient group with high age and BMI), a first combination corresponding to "Q: text features, K: image features, V: image features" can be applied to generate combined features that learn the interaction between image features and text features. On the other hand, in another group (e.g., a patient group with high smoking and drinking frequency), a first combination corresponding to "Q: image features, K: text features, V: text features" can be applied to generate combined features that learn the associations of text data centered on image data.

[0178] The device (100) can generate attention weights by using image features and text features generated using training data classified by each group, inputting Q, K, and V set as a first combination applied to each group into a cross-attention layer, calculating the similarity between Q and K, and normalizing the calculated similarity with a Softmax function. The device (100) can generate combined features containing interaction information between images and text by applying V as a weighted sum to the attention weights generated through the cross-attention layer.

[0179] The combined features generated in this way integrally reflect pathological information from image data (e.g., location and morphology of tumors) and environmental / genetic factors from text data (e.g., smoking, family history, etc.), providing a foundation for learning complex relationships between the two datasets.

[0180] In step S3060, the device (100) can train each MLP model with inputs and outputs set to predict a correct class for gastric cancer diagnosis from the combined features of each group.

[0181] For example, the device (100) may set a predetermined objective function in the MLP model. For example, the objective function may include a cross-entropy loss designed to maximize the prediction accuracy of the MLP model for gastric cancer diagnosis, and the objective function may quantitatively calculate the discrepancy between the correct class of the training data and the predicted value.

[0182] Using this objective function, the device (100) can train the parameters of the MLP model so that the loss between the predicted value of the MLP model for each group’s combined features and the correct class of the training data is minimized. For example, the training of the MLP model can be performed through a gradient descent algorithm, and the parameters of the MLP model can be optimized by calculating the gradient of the loss function.

[0183] Meanwhile, the device (100) may train only the parameters of the MLP model, but according to the embodiment, end-to-end training may be performed so that each parameter included in the image encoder of S3020, the text encoder of S3030, the cross-attention layer of S3050, and the MLP model is updated. In this case, the parameters of the pre-trained image encoder of S3020, the text encoder of S3030, and the cross-attention layer of S3050 are fine-tuned in a direction that minimizes the loss of the objective function of the MLP model.

[0184] For example, an image encoder can be tuned to more precisely extract pathological features, such as the location or morphology of tumors, from gastric biopsy tissue images, and a text encoder can be trained to better reflect keywords within survey text that have a direct correlation with the diagnosis. In the case of the cross-attention layer, attention weights can be optimized to emphasize important relationships in the interaction between images and text. Such end-to-end learning can contribute to improving the overall predictive performance of the model by integrally reflecting the diverse characteristics of the training data.

[0185] According to the above-described embodiment, the present invention provides an AI-based diagnostic technology for gastric cancer risk assessment, which can precisely evaluate and predict the risk of individual patients by fusing gastric biopsy tissue images and survey data using the cross-attention mechanism of a Transformer model.

[0186] Specifically, the present invention can provide the effect of enhancing the interpretability of a model through a technique that selects the optimal Q, K, V combination by analyzing the correlation between the concentration of attention similarity-based heatmaps for each Q, K, V combination and the gastric cancer lesion site. To this end, the present invention utilizes attention similarity-based heatmaps to evaluate how effectively a Transformer model focuses on pathological features, such as gastric cancer lesions, thereby increasing the reliability of diagnostic results and enabling medical professionals to understand the results more intuitively. Through this, the location and characteristics of the lesion can be identified more accurately, and the potential for early diagnosis and preventive treatment is increased.

[0187] Furthermore, the present invention can provide the effect of maximizing the diagnostic performance of an AI model through a technology that dynamically evaluates Q, K, and V combinations based on performance indicators (precision, recall, F1 score, etc.) and selects the optimal combination. To this end, the present invention generates various Q, K, and V combinations during the training phase and derives the optimal Q, K, and V settings by quantitatively comparing the performance of each combination, thereby simultaneously improving the efficiency of data training and the accuracy of diagnosis. This contributes to continuously improving the model's predictive ability by providing combinations optimized for data characteristics and the training environment in real time.

[0188] Furthermore, the present invention provides a technology that groups patients by clustering statistical distributions based on patient survey data and applies optimized Q, K, and V combinations for each group, thereby enabling personalized diagnosis and risk assessment. To this end, the present invention subdivides patient groups based on lifestyle data such as smoking, drinking, and exercise habits, and provides patient-tailored diagnostic results by applying optimal Q, K, and V settings suited to the characteristics of each group. Through this, personalized prevention and management strategies can be effectively presented, and the quality of medical services can be improved and patient satisfaction increased.

[0189] Thus, the present invention can provide an innovative AI-based diagnostic system that simultaneously realizes precision and personalization in gastric cancer risk assessment through the cross-attention mechanism of a transformer model and optimized Q, K, and V setting technology. This enhances the possibility of early diagnosis and treatment, and demonstrates significant technical effects capable of presenting a new paradigm in medical data analysis.

[0190] The various embodiments of this document and the terms used therein are not intended to limit the technical features described in this document to specific embodiments, and should be understood to include various modifications, equivalents, or substitutions of said embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more items unless the relevant context clearly indicates otherwise.

[0191] In this document, each of the phrases such as “A or B,” “at least one of A and B,” “at least one of A or B,” “A, B or C,” “at least one of A, B and C,” and “at least one of A, B, or C” may include all possible combinations of items listed together in the corresponding phrase. Terms such as “1,” “2,” or “first” or “second” may be used simply to distinguish a component from another component and do not limit the components in any other aspect (e.g., importance or order). Where any (e.g., 1st) component is referred to as “coupled” or “connected” to another (e.g., 2nd) component, with or without the terms “functionally” or “communicationly,” it means that the component may be connected to the other component directly (e.g., wired), wirelessly, or through a third component.

[0192] As used in this document, the term "module" may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit. A module may be a component formed integrally, or a minimum unit of a component or part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).

[0193] Various embodiments of this document may be implemented as software (e.g., a program) comprising one or more instructions stored in a storage medium (e.g., memory) that can be read by a device (e.g., an electronic device). The storage medium may include random access memory (RAM), a memory buffer, a hard drive, a database, erasable programmable read-only memory (EPROM), electrically erasable read-only memory (EEPROM), read-only memory (ROM), and / or the like.

[0194] Additionally, the processor of the embodiments of this document may call at least one instruction among one or more instructions stored from a storage medium and execute it. This enables the device to operate to perform at least one function according to at least one called instruction. Such one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. The processor may be a general-purpose processor, a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), and / or the like.

[0195] A device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' simply means that the storage medium is a tangible device and does not contain signals (e.g., electromagnetic waves), and this term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily.

[0196] Methods according to the various embodiments disclosed in this document may be provided as part of a computer program product. The computer program product may be traded between a seller and a buyer as a product. The computer program product may be distributed in the form of a device-readable storage medium (e.g., compact disc read-only memory (CD-ROM)), or distributed online (e.g., download or upload) through an application store (e.g., Play Store) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily created on a device-readable storage medium, such as a manufacturer's server, an application store's server, or the server's memory.

[0197] According to various embodiments, each component (e.g., module or program) of the described components may include a singular or multiple entities. According to various embodiments, one or more of the components or operations of the aforementioned components may be omitted, or one or more other components or operations may be added. Generally or additionally, multiple components (e.g., module or program) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the components of the multiple components in the same or similar manner as they were performed by the corresponding component among the multiple components prior to integration. According to various embodiments, operations performed by a module, program, or other component may be executed sequentially, in parallel, iteratively, or heuristically; one or more of the operations may be executed in a different order; omitted; or one or more other operations may be added.

Claims

1. A method performed by a device operated by a processor, The action of acquiring training data including images of a patient's gastric biopsy tissue and survey text; The operation of generating image features from the image using an image encoder; The operation of generating text features from the text using a text encoder; An operation to determine a first combination applied to the heatmap with the highest similarity between the attention weight-based heatmap generated based on each combination of Query, Key, and Value applicable to a cross-attention mechanism based on the image features and text features above, and the bounding box labeled on the gastric biopsy tissue image above; The operation of generating a combination feature based on the first combination using the cross-attention layer; and The operation of training an MLP model with inputs and outputs configured to predict a correct answer class for gastric cancer diagnosis from the above combined features, method.

2. In Paragraph 1, The above gastric biopsy tissue image An image obtained by biopsying the patient's stomach tissue includes a bounding box specifying an area containing a tumor within the stomach tissue, and a correct answer class specifying gastric cancer diagnostic information for said bounding box is mapped thereto. The survey text above is Includes text regarding questions and answers about the patient's lifestyle habits, physical information, and family history, and mapped to a correct answer class that specifies gastric cancer diagnosis information for the patient who performed the said questions and answers. method.

3. In Paragraph 1, The above image encoder is It includes a Vision Transformer model and is pre-trained to generate image features composed of multidimensional vectors from the above-mentioned biopsy tissue image, and The above text encoder is Includes a Transformer model and is pre-trained to generate text features composed of multidimensional vectors from the survey text, method.

4. In Paragraph 1, The above cross-attention layer is Setting a Query, Key, and Value composed of either the image feature or the text feature, calculating the similarity between the Query and Key based on the Multi-Head Attention algorithm of the Transformer model, generating an attention weight-based heatmap by normalizing the calculated similarity with a Softmax function, and generating a combined feature by weighting the attention weights to the Value. method.

5. In Paragraph 4, The operation of determining the above first combination is The operation of generating an attention weight-based heatmap for each of the above combinations based on the above training data; An operation to extract attention regions in the heatmap for each of the above combinations where the attention score is greater than or equal to a predetermined threshold; An operation to calculate the IOU (Intersection Over Union) value between the attention region of the heatmap for each combination and the bounding box labeled on the upper biopsy tissue image of the training data used to generate the corresponding heatmap; and A method including an operation to determine the combination applied to the heatmap with the largest calculated IOU value for each combination as the first combination. method.

6. In Paragraph 5, The operation of determining the above first combination is When generating multiple heatmaps for each combination using multiple training data, The operation of generating multiple heatmaps based on attention weights for each of the above combinations based on multiple training data; An operation to extract attention regions in which the attention score is greater than or equal to a predetermined threshold from a plurality of heatmaps for each of the above combinations; The operation of calculating the average of the Intersection Over Union (IOU) values between the attention regions of multiple heatmaps for each combination and the bounding boxes labeled on the upper biopsy tissue images of each training data used to generate each heatmap; and The operation of determining the combination applied to the heatmap with the largest average of IOU values calculated for multiple heatmaps for each combination as the first combination, method.

7. In Paragraph 4, The operation of determining the above first combination is The operation of generating an attention weight-based heatmap for each of the above combinations based on the above training data; An operation to extract attention regions in the heatmap for each of the above combinations where the attention score is greater than or equal to a predetermined threshold; An operation to calculate the inner product between a first feature vector that vectorizes the attention region of a heatmap for each combination and a second feature vector that vectorizes the bounding box labeled on the upper biopsy tissue image of the training data used to generate the heatmap; A method including an operation to determine the combination applied to the heatmap with the largest inner product value calculated for each combination as the first combination. method.

8. In Paragraph 7, The operation of determining the above first combination is When generating multiple heatmaps for each combination using multiple training data, The operation of generating multiple heatmaps based on attention weights for each of the above combinations based on multiple training data; An operation to extract attention regions in which the attention score is greater than or equal to a predetermined threshold from a plurality of heatmaps for each of the above combinations; The operation of generating a vector for each of the attention regions of a plurality of heatmaps for each combination, and then calculating the inner product between a first feature vector formed by combining each vector and a second feature vector formed by vectorizing bounding boxes labeled on the upper biopsy tissue images of the training data used to generate the corresponding heatmaps; and A process including determining the combination applied to the heatmap with the largest inner product value calculated for multiple heatmaps for each combination as the first combination. method.

9. In Paragraph 8, The above first feature vector is Generated by combining each of the above vectors based on the PCA (Principal Component Analysis) technique, method.

10. In Paragraph 1, The operation of training the above MLP model is A method comprising setting a predetermined objective function in the MLP model and performing end-to-end learning such that the image encoder, the text encoder, the cross-attention layer, and each parameter included in the MLP model are updated based on the objective function so as to minimize the loss between the predicted value of the MLP model for the first combination-based combined feature and the correct class. method.

11. A method performed by a device operated by a processor, An action of acquiring training data including images of a patient's gastric biopsy tissue and lifestyle questionnaire text; An operation of classifying into multiple groups such that each group has a statistical distribution of uniform characteristics based on the statistical distribution of patient characteristics included in the above training data; An operation to generate image features from the above image based on an image encoder; An operation to generate text features from the above text based on a text encoder; An operation of applying different combinations among each combination of Query, Key, and Value applicable to the cross-attention mechanism to each group based on the above image features and the above text features; An operation to generate combination-based combined features applied to each group using the above cross-attention layer; The operation of training each test MLP model, with inputs and outputs configured to predict the correct answer class for gastric cancer diagnosis from the combined features of each group; and The operation of extracting combined features for all training data and training a final MLP model based on a first combination of groups applied to the training of the test MLP model exhibiting the highest performance among each of the above test MLP models, method.

12. In Paragraph 11, The above gastric biopsy tissue image An image obtained by biopsying the patient's stomach tissue includes a bounding box specifying an area containing a tumor within the stomach tissue, and a correct answer class specifying gastric cancer diagnostic information for said bounding box is mapped thereto. The survey text above is Includes text regarding questions and answers about the patient's lifestyle habits, physical information, and family history, and mapped to a correct answer class that specifies gastric cancer diagnosis information for the patient who performed the said questions and answers. method.

13. In Paragraph 12, The operation of classifying into the above multiple groups is The operation of classifying the patient's training data into k clusters (where k is a natural number greater than or equal to 2) by applying a k-clustering algorithm based on at least one of the questions and answers regarding the patient's lifestyle, physical information, and family history; and The operation of sequentially extracting a preset number of training data for each of the k clusters classified for each cluster, and generating a plurality of groups in which the training data classified for each cluster is evenly included. method.

14. In Paragraph 11, The operation of training the above test MLP model is The operation of creating a test MLP model for each group with inputs and outputs configured to predict the correct answer class for gastric cancer diagnosis from combined features; An operation to perform supervised learning such that parameters included in each test MLP model are updated to minimize the loss between the predicted value of the test MLP model for the combined features of each group and the ground truth class mapped to the training data of each group based on a predetermined objective function; and Includes an operation to evaluate the diagnostic performance of each combination by measuring the precision, recall, and F1 score of each trained test MLP model. method.

15. In Paragraph 14, The operation of training the above test MLP model is The operation of calculating a performance score through a weighted sum of the precision, recall, and F1 score of each of the above test MLP models; and The operation of determining a first combination applied to a first group applied to the training of the test MLP model with the highest weighted sum, method.

16. In Paragraph 15, The operation of generating the above final MLP model is A method comprising: setting a predetermined objective function in a final MLP model configured with inputs and outputs to predict a correct class for gastric cancer diagnosis from combined features; and performing end-to-end learning such that, based on the objective function, the image encoder, the text encoder, the cross-attention layer, and each parameter included in the final MLP model are updated to minimize the loss between the predicted value of the final MLP model for the first combination-based combined features and the correct class. method.

17. In Paragraph 11, The above image encoder is It includes a Vision Transformer model and is pre-trained to generate image features composed of multidimensional vectors from the above-mentioned biopsy tissue image, and The above text encoder is Includes a Transformer model and is pre-trained to generate text features composed of multidimensional vectors from the survey text, method.

18. In Paragraph 11, The above cross-attention layer is Setting a Query, Key, and Value composed of either the image feature or the text feature, calculating the similarity between the Query and Key based on the Multi-Head Attention algorithm of the Transformer model, generating an attention weight-based heatmap by normalizing the calculated similarity with a Softmax function, and generating a combined feature by weighting the attention weights to the Value. method.

19. A method performed by a device operated by a processor, An action of acquiring training data including images of a patient's gastric biopsy tissue and lifestyle questionnaire text; An operation of classifying into multiple groups such that each group has a statistical distribution of different characteristics based on the statistical distribution of patient characteristics included in the above training data; An operation to generate image features from the above image based on an image encoder; An operation to generate text features from the above text based on a text encoder; An operation to determine the first combination with the highest performance for each group among each combination of Query, Key, and Value applicable to the cross-attention mechanism based on the above image features and the above text features; An operation to generate a combination feature of each group based on a first combination determined for each group using the above cross-attention layer; and Includes the operation of training an MLP model with inputs and outputs configured to predict the correct answer class for gastric cancer diagnosis from the combined features of each group. method.

20. In Paragraph 19, The above gastric biopsy tissue image An image obtained by biopsying the patient's stomach tissue includes a bounding box specifying an area containing a tumor within the stomach tissue, and a correct answer class specifying gastric cancer diagnostic information for said bounding box is mapped thereto. The survey text above is Includes text regarding questions and answers about the patient's lifestyle habits, physical information, and family history, and mapped to a correct answer class that specifies gastric cancer diagnosis information for the patient who performed the said questions and answers. method.

21. In Paragraph 20, The operation of classifying into the above multiple groups is The operation of classifying the patient's training data into k clusters (where k is a natural number greater than or equal to 2) by applying a k-clustering algorithm based on at least one of the questions and answers regarding the patient's lifestyle, physical information, and family history; and A method comprising the operation of grouping the training data of patients classified into each of the k clusters to generate multiple groups, each group having a statistical distribution of different characteristics. method.

22. In Paragraph 19, The operation of determining the first combination with the highest performance for each of the above groups is A method comprising determining as the first combination to be applied to the first group the combination applied to the heatmap having the highest similarity between the attention weight-based heatmap generated based on each combination of Query, Key, and Value applicable to a cross-attention mechanism based on image features and text features of training data belonging to the first group among the above groups, and the bounding box labeled on the above gastric biopsy tissue image. method.

23. In Paragraph 22, The operation of determining the first combination to be applied to the above first group is The operation of generating an attention weight-based heatmap for each combination based on training data belonging to the first group among the above groups; An operation to extract attention regions in the heatmap for each of the above combinations where the attention score is greater than or equal to a predetermined threshold; An operation to calculate the IOU (Intersection Over Union) value between the attention region of the heatmap for each combination and the bounding box labeled on the upper biopsy tissue image of the training data used to generate the corresponding heatmap; and A method comprising determining the combination applied to the heatmap with the largest calculated IOU value for each combination as the first combination to be applied to the first group. method.

24. In Paragraph 23, The operation of determining the first combination to be applied to the above first group is When generating multiple heatmaps for each combination using multiple training data, The operation of generating a plurality of attention weight-based heatmaps for each combination based on a plurality of training data belonging to the first group among the above groups; An operation to extract attention regions in which the attention score is greater than or equal to a predetermined threshold from a plurality of heatmaps for each of the above combinations; The operation of calculating the average of the Intersection Over Union (IOU) values between the attention regions of multiple heatmaps for each combination and the bounding boxes labeled on the upper biopsy tissue images of each training data used to generate each heatmap; and The operation of determining the combination applied to the heatmap with the largest average of the IOU values calculated for a plurality of heatmaps for each combination as the first combination to be applied to the first group. method.

25. In Paragraph 22, The operation of determining the first combination to be applied to the above first group is The operation of generating an attention weight-based heatmap for each combination based on training data belonging to the first group among the above groups; An operation to extract attention regions in the heatmap for each of the above combinations where the attention score is greater than or equal to a predetermined threshold; An operation to calculate the inner product between a first feature vector that vectorizes the attention region of a heatmap for each combination and a second feature vector that vectorizes the bounding box labeled on the upper biopsy tissue image of the training data used to generate the heatmap; A method comprising determining the combination applied to the heatmap with the largest inner product value calculated for each combination as the first combination to be applied to the first group. method.

26. In Paragraph 25, The operation of determining the first combination to be applied to the above first group is When generating multiple heatmaps for each combination using multiple training data, The operation of generating a plurality of attention weight-based heatmaps for each combination based on a plurality of training data belonging to the first group among the above groups; An operation to extract attention regions in which the attention score is greater than or equal to a predetermined threshold from a plurality of heatmaps for each of the above combinations; The operation of generating a vector for each of the attention regions of a plurality of heatmaps for each combination, and then calculating the inner product between a first feature vector formed by combining each vector and a second feature vector formed by vectorizing bounding boxes labeled on the upper biopsy tissue images of the training data used to generate the corresponding heatmaps; and The operation of determining the combination applied to the heatmap with the largest inner product value calculated for a plurality of heatmaps for each combination as the first combination to be applied to the first group. method.

27. In Paragraph 26, The above first feature vector is Generated by combining each of the above vectors based on the PCA (Principal Component Analysis) technique, method.

28. In Paragraph 19, The operation of training the above MLP model is A method comprising an operation of performing end-to-end learning such that a predetermined objective function is set in the MLP model, and the respective parameters included in the image encoder, the text encoder, the cross-attention layer, and the MLP model are updated so as to minimize the loss between the predicted value of the MLP model for the combined features generated from the training data belonging to each group based on a first combination determined for each group based on the objective function, and the correct class. method.