Analysis device, analysis method, and analysis program

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The analysis device facilitates the analysis and visualization of Pareto fronts in discrete data spaces by clustering and identifying overlaps in linear discriminant models, addressing the challenges of singularities and enabling effective identification of optimal solutions.

WO2026126566A1PCT designated stage Publication Date: 2026-06-18HITACHI LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HITACHI LTD
Filing Date: 2025-07-29
Publication Date: 2026-06-18

Application Information

Patent Timeline

29 Jul 2025

Application

18 Jun 2026

Publication

WO2026126566A1

IPC: G06F18/2451; G06F18/23; G06F30/20; G06N99/00; G06F111/06

AI Tagging

Application Domain

Computing modelsDesign optimisation/simulation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Component matching determination support tool
JP2026094231AData processing applicationsComputing models
Method for an at least partially decentralized calculation of the state of health of at least one wind turbine
EP4111280B1Programme control Safety arrangments State of health Turbine
Information processing device, information processing method, program
JP2026089867AComputing modelsComplex mathematical operations
Optimization processing apparatus, optimization processing method, and program
JPWO2025069185A5Computing models
Optical ising calculation device
WO2026126559A1Computing modelsDigital data

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure JP2025026889_18062026_PF_FP_ABST

Patent Text Reader

Abstract

This analysis device including a processor that executes a program and a storage device that stores the program executes: an acquisition process for acquiring, for each of a plurality of characteristics related to an analysis object expressed by discrete data having a plurality of discrete values, coefficient data of a linear discrimination model that discriminates the characteristics for each of the discrete data; a classification process for classifying the coefficient data acquired for each of the discrete data by the acquisition process into a plurality of clusters for each of the characteristics; an identification process for identifying the discrete data corresponding to the discrimination model of the coefficient data belonging to two or more clusters differing in characteristics classified by the classification process; and an output process for outputting the discrete data identified by the identification process.

Need to check novelty before this filing date? Find Prior Art

Description

Analysis device, analysis method, and analysis program Import by reference

[0001] This application claims priority to Japanese Patent Application No. 2024-218721, filed on December 13, 2024, and incorporates its contents by reference.

[0002] The present invention relates to an analysis device, an analysis method, and an analysis program for analyzing data.

[0003] In industrial product development, such as drug discovery, it is necessary to find candidates that meet the criteria for multiple evaluation items. Taking antibody development as an example, these include binding affinity, solubility, hydrophobicity, and immunogenicity, and even in early development, the number of evaluation items typically reaches several dozen.

[0004] The search process to find candidates that satisfy multiple evaluation criteria is called multi-objective optimization, but it is generally known to become difficult when the number of evaluation criteria exceeds four. In multi-objective optimization, a Pareto front is searched for evaluation criteria that have a trade-off relationship, and understanding its geometric structure is useful for search strategies and manufacturing parameter design strategies (such as how much the conditions can be relaxed in consideration of adaptation). However, when the dimensionality of the evaluation criteria exceeds four, there is a challenge in identifying and visualizing this structure.

[0005] Patent Document 1 below discloses an output program that assists in the selection of a solution. This output program causes a computer to perform the following processes: acquisition, fitting, generation, acceptance, and output. The acquisition process acquires first graph data relating to the solution on the Pareto front and second graph data relating to a plurality of evaluation values corresponding to the solution. The fitting process fits a Bezier figure to each of the acquired graph data. The generation process generates a plurality of data points on the Bezier figures for each of the first and second graph data. The acceptance process accepts input conditions indicating the range of each of the plurality of evaluation values. The output process outputs the solution shown on the data points on the Bezier figure fitted to the first graph data, which correspond to the data points on the Bezier figure fitted to the second graph data selected based on the accepted conditions.

[0006] Patent Document 2 below discloses a knee-point visualization device that visualizes a sample containing noise under conditions close to those of a true, noise-free state. The knee-point visualization device identifies a weight region in which a specific knee point continuously corresponds to a knee point. The knee-point visualization device calculates the stability of each weight from among multiple weights in which a specific knee point corresponds to a knee point, based on the width of the weight region. Based on the stability, the knee-point visualization device identifies the optimal weight for visualization from among the multiple weights, and visualizes the specific knee point based on the identified weight.

[0007] Japanese Patent Publication No. 2024-062751 Japanese Patent Publication No. 2019-153012

[0008] According to Morse theory, if the objective function, or a function perturbed by the objective function, is a Morse function, then the geometric structure of the Pareto front is a simplex. The above method implicitly assumes these favorable mathematical conditions. Furthermore, some search method requires obtaining multiple Pareto solutions beforehand, and the quality of the estimation depends on how these solutions are obtained.

[0009] On the other hand, antibody design uses amino acid sequences as input for multi-objective optimization, but amino acid sequences are discrete space data relating to a 20-character multidimensional sequence. Therefore, it is not possible to assume that the domain is a smooth manifold. Furthermore, it is not possible to make assumptions about the objective function, such as a Morse function without singularities, making it difficult to estimate the Pareto front by interpolation and extrapolation using Bézier simplex, etc. For example, if some amino acids in an amino acid sequence are swapped, the amino acid sequence before and after the swap will be similar in sequence, but the three-dimensional structure will change significantly, the physical properties will change drastically, and singularities will exist. In addition, as the number of evaluation items increases, it becomes difficult to obtain a Pareto solution beforehand.

[0010] This type of problem is not limited to amino acid sequences; it can occur with other data in discrete space as well.

[0011] The present invention aims to facilitate the analysis of Pareto fronts.

[0012] An analysis device comprising one aspect of the invention disclosed in this application is an analysis device having a processor that executes a program and a storage device that stores the program, wherein the processor performs an acquisition process for acquiring coefficient data of a linear discriminant model that discriminates a characteristic for each of a plurality of characteristics relating to an object to be analyzed, which are represented by discrete data having a plurality of discrete values; a classification process for classifying the coefficient data acquired for each of the discrete data by the acquisition process into a plurality of clusters for each of the characteristics; a specification process for identifying the discrete data corresponding to the discriminant model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process; and an output process for outputting the discrete data identified by the specification process.

[0013] According to a typical embodiment of the present invention, the analysis of the Pareto front can be facilitated. Problems, configurations, and effects other than those described above will be clarified by the following description of the embodiments.

[0014] Figure 1 is an explanatory diagram showing an example of the system configuration of the analysis system. Figure 2 is a block diagram showing an example of the hardware configuration of the computer. Figure 3 is an explanatory diagram showing an example of a sample DB. Figure 4 is an explanatory diagram showing an example of a cluster DB. Figure 5 is an explanatory diagram showing an example of an overlap DB. Figure 6 is a block diagram showing an example of the functional configuration of the analysis device. Figure 7 is a flowchart showing an example of the analysis processing procedure by the analysis device. Figure 8 is a flowchart showing a detailed example of the processing procedure for the clustering process of the discriminant model (step S704). Figure 9 is an explanatory diagram showing an example of the generation of a representative discriminant model. Figure 10 is an explanatory diagram showing a representative discriminant model. Figure 11 is an explanatory diagram showing an example of singularity determination in step S706. Figure 12 is an explanatory diagram showing an example of independence determination in step S706. Figure 13 is an explanatory diagram showing an example of the output result in the output processing (step S707). Figure 14 is an explanatory diagram showing an example of the rearrangement of coefficient data. Figure 15 is an explanatory diagram showing an example of the structure of a neural network according to Embodiment 2. Figure 16 is a block diagram showing an example of the functional configuration of the acquisition unit according to Embodiment 2.

[0015] <Figure 1 Analysis System> Figure 1 is an explanatory diagram showing an example of the system configuration of the analysis system. The analysis system 100 includes an analysis device 101 and a terminal device 102. The analysis device 101 and the terminal device 102 are connected to each other via a network 103 such as the Internet, LAN (Local Area Network), or WAN (Wide Area Network).

[0016] The analysis device 101 is a computer that performs structural analysis and visualization near the Pareto front. The analysis device 101 includes a sample DB 104, a cluster DB 105, and an overlap DB 106. The sample DB 104, cluster DB 105, and overlap DB 106 may be provided in a computer (not shown) that can communicate with the analysis device 101 via a network 103, rather than in the analysis device 101.

[0017] Sample DB104 is a database that stores data about samples. Cluster DB105 is a database that stores data about clusters, which are sets of sample groups. Overlap DB106 is a database that stores information about the overlap between clusters.

[0018] The terminal device 102 is a computer that communicates with the analysis device 101 via the network 103. The terminal device 102 inputs data and instructions to the analysis device 101 and displays output data from the analysis device 101.

[0019] <Figure 2 Example of Hardware Configuration of Computer (Analysis Device 101, Terminal Device 102)> Figure 2 is a block diagram showing an example of the hardware configuration of a computer. Computer 200 includes a processor 201, a storage device 202, an input device 203, an output device 204, and a communication interface (communication IF) 205. The processor 201, storage device 202, input device 203, output device 204, and communication IF 205 are connected by a bus 206. The processor 201 controls computer 200. The storage device 202 serves as the work area for the processor 201. The storage device 202 is also a non-temporary or temporary recording medium that stores various programs and data. Examples of memory devices 202 include ROM (Read Only Memory), RAM (Random Access Memory), HDD (Hard Disk Drive), and flash memory. Input devices 203 input data. Examples of input devices 203 include keyboards, mice, touch panels, numeric keypads, scanners, microphones, and sensors. Output devices 204 output data. Examples of output devices 204 include displays, printers, and speakers. Communication IF 205 connects to network 103 and sends and receives data.

[0020] <Figure 3 Sample DB104> Figure 3 is an explanatory diagram showing an example of Sample DB104. Sample DB104 has, as fields, Sample ID 301, amino acid sequence 302, and evaluation item group 303.

[0021] Sample ID 301 is identification information that uniquely identifies the sample. The sample is the target of structural analysis near the Pareto front, and in this example, it is amino acid sequence 302. However, any sample other than amino acid sequence 302 may be used, as long as it is a sequence of multiple discrete values.

[0022] The amino acid sequence 302 is a sequence of amino acids including alanine (A), cysteine (C), aspartic acid (D), glutamic acid (E), phenylalanine (F), glycine (G), histidine (H), isoleucine (I), lysine (K), leucine (L), methionine (M), asparagine (N), proline (P), glutamine (Q), arginine (R), serine (S), threonine (T), valine (V), tryptophan (W), and tyrosine (Y). In Example 1, the amino acid sequence 302 is the sequence of 20 amino acids contained in the CDR3 region of the heavy chain.

[0023] Furthermore, the entries in Sample DB104 may include samples that have been perturbed from a given sample. Specifically, for example, an amino acid sequence Aj (i≠j) obtained by replacing some amino acids in an amino acid sequence Ai may be included in the entries of Sample DB104.

[0024] The evaluation item group 303 is a set of evaluation items. An evaluation item is a sample characteristic used to search for candidate samples in multi-objective optimization. In Figure 3, as an example of evaluation items, binding affinity information 303a, solubility information 303b, and hydrophobicity information 303c are used. There should be at least two evaluation items.

[0025] The binding affinity information 303a includes binding affinity 331a and coefficient data 332a. Binding affinity 331a is an evaluation value indicating the strength of the binding interaction between the target amino acid sequence 302 and the ligand (e.g., a drug or inhibitor). A smaller evaluation value indicates that the target and ligand bind more easily.

[0026] The coefficient data 332a is a set of one or more coefficients in a discrimination model that determines the binding affinity 331a of amino acid sequence 302, and is represented by a vector.

[0027] The binding affinity 331a discrimination model is a linear model that determines whether the amino acid sequence 302 has binding affinity 331a based on whether Ya is equal to or greater than a predetermined threshold for discriminating binding affinity 331a, using a discrete value Xk (where k is a number that identifies the position of the amino acid from the beginning, and k = 1 to K; in this example, K = 20) unique to each amino acid in the amino acid sequence 302 as an explanatory variable, and Ya as the dependent variable (Equation (1) below).

[0028]

[0029] ak is the k-th coefficient from the beginning in the coefficient data 332a. That is, the coefficient data 332a is (a1, a2, ..., aK).

[0030] Solubility information 303b includes solubility 331b and coefficient data 332b. Solubility 331b is an evaluation value that indicates the limit amount of amino acid sequence 302 that dissolves in a certain amount of water at a certain temperature; the smaller the evaluation value, the more easily the amino acid sequence 302 dissolves.

[0031] The coefficient data 332b is a set of one or more coefficients in a discrimination model that determines the solubility 331b of amino acid sequence 302, and is represented by a vector.

[0032] The solubility 331b discrimination model is a linear model in which, with discrete values Xk specific to each amino acid in the amino acid sequence 302 as explanatory variables and Yb as the dependent variable, Ya is greater than or equal to a predetermined threshold for discriminating solubility 331b, thereby determining whether the amino acid sequence 302 has solubility 331b (see equation (2) below).

[0033]

[0034] bk is the k-th coefficient from the beginning in the coefficient data 332b. That is, the coefficient data 332b is (b1, b2, ..., bK).

[0035] The hydrophobic information 303c includes hydrophobicity 331c and coefficient data 332c. Hydrophobicity 331c is an evaluation value indicating that the amino acid sequence 302 is poorly soluble in water or poorly miscible with water. A larger evaluation value indicates that the amino acid sequence 302 is poorly soluble in water or poorly miscible with water.

[0036] The coefficient data 332c is a set of one or more coefficients in a discrimination model that distinguishes the hydrophobicity 331c of amino acid sequence 302, and is represented by a vector.

[0037] The hydrophobicity 331c discrimination model is a linear model in which, when the discrete value Xk unique to each amino acid in the amino acid sequence 302 is the explanatory variable and Yc is the dependent variable, whether or not the amino acid sequence 302 has hydrophobicity 331c is determined by whether or not Ya is greater than or equal to a predetermined threshold for discriminating hydrophobicity 331c (see equation (3) below).

[0038]

[0039] ck is a coefficient, and it is the k-th coefficient from the beginning in the coefficient data 332c. That is, the coefficient data 332c is (c1, c2, ..., cK).

[0040] In equations (1) to (3) above, if the target variables Ya to Yc are not distinguished, they are referred to as the target variable Y. Similarly, if the coefficient data 332a to 332c are not distinguished, they are referred to as coefficient data 332. Coefficient data 332 may be given in advance or derived by a specific process. In Example 1, coefficient data 332 is assumed to be given in advance. The derivation of coefficient data 332 will be described later in Example 2.

[0041] <Figure 4 Cluster DB 105> Figure 4 is an explanatory diagram showing an example of the cluster DB 105. The cluster DB 105 is a database created by clustering the coefficient data 332 of the discrimination model for each of the multiple samples for each evaluation item. The cluster DB 105 has a sample ID 301 and cluster information 401 as fields.

[0042] The cluster information 401 includes a binding affinity cluster 401a whose evaluation item is binding affinity 331a, a solubility cluster 401b whose evaluation item is solubility 331b, and a hydrophobic cluster 401c whose evaluation item is hydrophobicity 331c. In the binding affinity cluster 401a, two or more clusters Ca1, Ca2, ... are generated; in the solubility cluster 401b, two or more clusters Cb1, Cb2, ... are generated; and in the hydrophobic cluster 401c, two or more clusters Cc1, Cc2, ... are generated.

[0043] The entry for each sample identified by sample ID 301 in cluster DB105 indicates which cluster the coefficient data 332 of the discrimination model for that sample belongs to. "1" indicates that it belongs to that cluster, and "0" indicates that it does not belong to that cluster. For example, the coefficient data 332 of the discrimination model for amino acid sequence 302 with sample ID 301 "S1" can be seen to belong to clusters Ca1 and Ca2, which are binding affinity clusters 401a, to cluster Cb1, which is solubility cluster 401b, and to cluster Cc2, which is hydrophobic cluster 401c.

[0044] <Figure 5 Overlap DB 106> Figure 5 is an explanatory diagram showing an example of the overlap DB 106. The overlap DB 106 is a database created by a process that identifies the overlap between clusters of different evaluation items in the cluster DB 105. The overlap DB 106 has the following fields: cluster combination 501, member sample ID 502, and sample number 503.

[0045] Cluster combination 501 is a brute-force combination of clusters between different evaluation items. In this example, there are three evaluation items: binding affinity 331a, solubility 331b, and hydrophobicity 331c. Therefore, cluster combination 501 can be a combination of a cluster of binding affinity 331a, a cluster of solubility 331b, and a cluster of hydrophobicity 331c; a combination of a cluster of binding affinity 331a and a cluster of solubility 331b; a combination of a cluster of binding affinity 331a and a cluster of hydrophobicity 331c; or a combination of a cluster of solubility 331b and a cluster of hydrophobicity 331c.

[0046] The assigned sample ID 502 is the sample ID 301 of the amino acid sequence 302 that corresponds to the coefficient data 332 belonging to any of the cluster combinations 501. For example, the assigned sample ID 502 for the combination of cluster Ca1 for binding affinity 331a, cluster Cb1 for solubility 331b, and cluster Cc2 for hydrophobicity 331c in the second row entry is "S1".

[0047] This indicates that the amino acid sequence 302: "A1" with sample ID 301 "S1" belongs to any of the clusters: cluster Ca1 for binding affinity 331a, cluster Cb1 for solubility 331b, and cluster Cc2 for hydrophobicity 331c. In other words, in the entry for sample ID 301 "S1" in cluster DB105, the values for cluster Ca1 for binding affinity 331a, cluster Cb1 for solubility 331b, and cluster Cc2 for hydrophobicity 331c are all set to "1".

[0048] If sample ID 502 does not exist, it indicates that there are no overlapping coefficient data 332 in that cluster combination 501. The sample number 503 is the number of samples with sample ID 502.

[0049] <Figure 6 Example of Functional Configuration of Analysis Device 101> Figure 6 is a block diagram showing an example of the functional configuration of the analysis device 101. The analysis device 101 includes an acquisition unit 601, a classification unit 602, a specific unit 603, a determination unit 604, and an output unit 605. Specifically, the acquisition unit 601, the classification unit 602, the specific unit 603, the determination unit 604, and the output unit 605 are realized, for example, by having the processor 201 execute a program stored in the storage device 202 shown in Figure 2.

[0050] The acquisition unit 601 acquires the amino acid sequence 302 and its coefficient data 332a, 332b, and 332c from the sample DB 104.

[0051] The classification unit 602 clusters the coefficient data 332 of each discrimination model for the multiple amino acid sequences 302 acquired by the acquisition unit 601 for each evaluation item. That is, the classification unit 602 classifies the coefficient data 332 of the multiple discrimination models into multiple clusters for each evaluation item. The classification unit 602 identifies the sample (amino acid sequence 302) of coefficient data 332 belonging to each cluster and generates cluster information 401.

[0052] The classification unit 602 clusters the set of coefficient data 332 in the feature space, which has K=20-dimensional coefficients as features, according to the evaluation items. Therefore, in this example, a group of binding affinity clusters 401a, a group of solubility clusters 4014b, and a group of hydrophobic clusters 401c are generated.

[0053] The identification unit 603 identifies overlaps between clusters of different evaluation items. Specifically, for example, the identification unit 603 creates an overlap DB 106 by referring to the cluster DB 105 and identifying samples of coefficient data 332 where clusters of different evaluation items overlap.

[0054] The determination unit 604 determines the characteristics of the geometric structure near the Pareto front. Specifically, for example, the determination unit 604 performs singularity determination and independence determination.

[0055] Singularity possibility determination is the process of determining whether or not there is a possibility of a singularity between two overlapping clusters. Specifically, for example, the determination unit 604 determines that there is a possibility of a singularity between representative discriminant models if the signs of the coefficient data 332 of the representative discriminant model of one cluster and the signs of the coefficient data 332 of the representative discriminant model of the other cluster are inverted. A representative discriminant model is a discriminant model that uses coefficient data obtained by statistically processing (for example, averaging) the coefficient data 332 within the cluster.

[0056] This result makes it easier to search for singularities. In other words, it indicates that the optimal point for at least one of the evaluation items in both overlapping clusters is likely to be near a singularity. Singularities are critical points where the phenotype undergoes a phase transition and are biologically distinctive.

[0057] Furthermore, independence determination is the process of determining the independence of the representative discriminant models for two overlapping clusters. Independence of representative discriminant models means that the overlap between the two clusters is maintained even if the coefficient at a certain sequence position k=p of one representative discriminant model, or the coefficient at a certain sequence position k=q (≠p) of the other representative discriminant model, is increased or decreased. This determination result makes it easier to estimate evaluation items that may be subject to simultaneous optimization.

[0058] The output unit 605 outputs various types of data. Specifically, for example, the output unit 605 outputs the cluster groups classified by the classification unit 602 for each evaluation item, the identification results by the identification unit 603, and the judgment results by the judgment unit 604 in a displayable format. The identification results indicate regions where a Pareto solution may exist. Specifically, for example, the identification results are sample IDs 301 corresponding to the coefficient data 332 in regions where clusters between different evaluation items overlap. Sample IDs 301 are displayed as points on the space spanned by the overlapping evaluation items (hereinafter referred to as the characteristic space).

[0059] For example, if sample ID 301 has amino acid sequence 302 "A1" as "S1", and this overlaps with the cluster for binding affinity 331a and the cluster for solubility 331b, then sample ID 301 will be displayed as a point in the characteristic space drawn between the axis for binding affinity 331a and the axis for solubility 331b, where the value for binding affinity 331a is Fa1 and the value for solubility 331b is Fb1. The value S1 of sample ID 301, the value A1 of amino acid sequence 302, the value Fa1 of binding affinity 331a, and the value Fb1 of solubility 331b are associated with this point, and this information can be displayed, for example, by pointing the mouse at this point.

[0060] In this context, the combination of overlapping properties is referred to as binding affinity 331a and solubility 331b, but any combination of at least two of the following is acceptable: binding affinity 331a, solubility 331b, and hydrophobicity 331c.

[0061] The output unit 605 may display this output data on a display, which is an example of an output device 204, or it may transmit it from the communication IF 205 to the terminal device 102 via the network 103 and have it displayed on a display, which is an example of an output device 204 of the terminal device 102.

[0062] <Figure 7 Example of analysis processing procedure by analysis device 101> Figure 7 is a flowchart showing an example of the analysis processing procedure by analysis device 101.

[0063] (Step S701) The analysis device 101 determines whether or not there are any unselected evaluation items. If there are any unselected evaluation items (Step S701: Yes), proceed to step S702. If there are no unselected evaluation items (Step S701: No), proceed to step S705.

[0064] (Step S702) The analysis device 101 selects one unselected evaluation item. The selected evaluation item is referred to as the selected evaluation item.

[0065] (Step S703) The analysis device 101 uses the acquisition unit 601 to acquire coefficient data 332 of the discrimination model for the selected evaluation item from the sample DB 104 for each sample.

[0066] (Step S704) The analysis device 101 performs clustering processing of the discrimination model using the classification unit 602, and returns to step S801. Details of the clustering processing of the discrimination model will be described later.

[0067] (Step S705) The analysis device 101 uses the identification unit 603 to identify samples of coefficient data 332 where clusters between evaluation items overlap, and creates an overlap DB 106.

[0068] (Step S706) The analysis device 101, using the determination unit 604, determines the possibility and independence of singularities based on the coefficient data of the representative linear models of both overlapping clusters.

[0069] (Step S707) The analysis device 101 outputs the amino acid sequence 302, which is the sample, the cluster groups classified by the classification unit 602 according to the evaluation items, the identification results by the identification unit 603, and the determination results by the determination unit 604 in a displayable format via the output unit 605. The identification results by the identification unit 603 indicate the region where a Pareto solution may exist. This completes the series of processes.

[0070] <Figure 8 Clustering process of the discrimination model (step S704)> Figure 8 is a flowchart showing a detailed example of the processing procedure for the clustering process of the discrimination model (step S704). Step S704 is performed for the selection evaluation items selected in step S702.

[0071] (Step S801) The analysis device 101 sets an initial value for the number of clusters. The set number of clusters is referred to as the set number of clusters C. The initial value of the set number of clusters C is, for example, C = 2. The upper limit of the set number of clusters P is called the upper limit number of clusters Pmax. The upper limit number of clusters Pmax is an integer of 2 or more and is set in advance.

[0072] (Step S802) The analyzer 101 clusters the sample group of the selected evaluation item so that the set number of clusters P is determined based on the coefficient data 332 of the discrimination model for each sample (amino acid sequence 302).

[0073] (Step S803) For each cluster in the cluster group generated in step S802, the analysis device 101 calculates the average value for the coefficients at the same position of the plurality of coefficient data included in the cluster, and generates a representative discrimination model representing the cluster for each cluster.

[0074] [FIG. 9 Generation Example of Representative Discrimination Model] FIG. 9 is an explanatory diagram showing a generation example of a representative discrimination model. In FIG. 9, as an example, the selection evaluation item is the binding affinity. The coefficient data group 900 is a set of coefficient data 332. In the example of FIG. 9, since the selection evaluation item is the binding affinity, the coefficient data group 900 is a set of coefficient data 332a.

[0075] Assume that in step S802, the coefficient data group 900 is classified into P clusters CL1 to CLP with the set number of clusters. The coefficient data 332a of the sample ID 301, amino acid sequence 302, and binding affinity information 303a of the samples belonging to a certain cluster CLp among the clusters CL1 to CLP are as shown in Table 901. Table 901 is information extracted from a part of the entries of the samples belonging to cluster CLp from the sample DB 104. p is an integer satisfying 2 ≤ p ≤ P.

[0076] As shown in Table 901, the sample IDs 301 of the samples belonging to cluster CLp are, for example, six of S1, S2, S15, S28, S33, and S47. The analysis device 101 calculates the average value of the coefficients at the same position of the coefficient data 332a of the samples belonging to cluster CLp, which are Va1, Va2, Va15, Va28, Va33, and Va47.

[0077] Specifically, for example, taking the coefficients a １ 1, a １ 2, a １ 15, a １ 28, a １ 33, a １ 47 of the coefficient data Va1, Va2, Va15, Va28, Va33, and Va47 as an example, the analysis device 101 calculates the coefficient a １ 1, a １ 2, a １ 15, a １ 28, a１ 33, a １ Average value a of 47 ａｖ Calculate 1.

[0078] Similarly, for the coefficients from the 2nd to the kth position, the analyzer 101 calculates the average value a ａｖ 2, a ａｖ 15, a ａｖ 28, a ａｖ 33, a ａｖ Calculate 47. The average value a at each position. ａｖ 1, a ａｖ 2, a ａｖ 15, a ａｖ 28, a ａｖ 33, a ａｖ 47 is representative coefficient data Va in cluster CLp. ａｖ (p) The representative coefficient data Va in the K-dimensional feature space (K=20 for amino acid sequence 302) ａｖ Coefficient a of (p) ａｖ 1, a ａｖ 2, a ａｖ 15, a ａｖ 28, a ａｖ 33, a ａｖ The linear model using 47 becomes the representative discriminant model for cluster CLp.

[0079] [Figure 10 Representative Discriminant Model] Figure 10 is an explanatory diagram showing the representative discriminant model. In Figure 10, as an example, the selection evaluation item is binding affinity. In Figure 10, a condition-satisfying sample is a sample that satisfies the condition (in the case of binding affinity, binding affinity 331a is below the threshold), and a condition-violating sample is a sample that violates the condition (in the case of binding affinity, binding affinity 331a is greater than the threshold). As shown in Figure 9, representative coefficient data Va ａｖ By calculating (p), the representative discriminant model RMp for cluster CLp is obtained. The same applies to the representative discriminant model RM1 for cluster CL1 and the representative discriminant model RMP for cluster CLP.

[0080] (Step S804) Returning to Figure 8, the analysis device 101 calculates the discrimination accuracy and variance of the representative discriminant model for each cluster. Specifically, for example, the analysis device 101 obtains all sample IDs 301 belonging to cluster CLp using cluster DB 105. The analysis device 101 extracts the amino acid sequence 302 of each sample ID 301 belonging to cluster CLp from sample DB 104. The analysis device 101 divides the number of positive values obtained when the discrete value Xk of the amino acid sequence 302 of each extracted sample ID 301 is substituted into the representative discriminant model RMp of cluster CLp by the number of sample IDs 301 belonging to cluster CLp. This division result is the discrimination accuracy. In Example 1, a sample is judged as correct if the representative discriminant model RMp is a positive value.

[0081] A higher representative discriminant model accuracy indicates better performance in classifying the evaluation item.

[0082] The variance D is calculated by the following equation (4). In the following equation (4), y on the right side is the result of substituting the discrete values Xk of each amino acid in the amino acid sequence 302 of a sample within cluster CLp into a discriminant model represented by the coefficient data 332 of that sample. yp on the right side is the result of substituting the discrete values Xk of each amino acid in the amino acid sequence 302 of a sample within cluster CLp into the representative discriminant model RMp of cluster CLp. t is the number of samples within cluster CLp. A smaller variance D indicates less variability in the coefficient data 332 from the representative discriminant model RMp.

[0083]

[0084] (Step S805) The analysis device 101 calculates the information standard quantity for each cluster and calculates the average information standard quantity, which is the average value of the information standard quantities. The information standard quantity is the sum of the discrimination accuracy and the variance D.

[0085] (Step S806) The analysis device 101 determines whether the set number of clusters P is equal to the upper limit number of clusters Pmax. If the set number of clusters P is not equal to the upper limit number of clusters Pmax (Step S806: No), the process proceeds to step S807. If the set number of clusters P is equal to the upper limit number of clusters Pmax (Step S806: Yes), the process proceeds to step S808.

[0086] (Step S807) The analysis device 101 increments the set cluster number P and returns to step S802. As a result, steps S802 to S805 are repeatedly executed until the set cluster number P reaches the upper limit cluster number Pmax.

[0087] (Step S808) The analysis device 101 adopts the clustering result with the minimum number of clusters that minimizes the average information standard, and returns to step S701. The adopted clustering result is used in steps S705 to S707.

[0088] <Figure 11 Example of singularity determination> Figure 11 is an explanatory diagram showing an example of singularity determination in step S706. In Figure 11, cluster CLa is one of the cluster groups whose evaluation item is binding affinity, and representative discriminant model RMa is the representative discriminant model for cluster CLa. Cluster CLb is another cluster group whose evaluation item is solubility, and representative discriminant model RMb is the representative discriminant model for cluster CLb. To simplify the explanation, let K = 3. In this case, representative discriminant models RMa and RMb are expressed as first-order linear models with K = 3 terms (Figure 12 is similar).

[0089] The analysis device 101 determines whether the signs of all coefficients at the same position are reversed between the representative discriminant models RMa and RMb. If the signs are all reversed, it is determined that there is a possibility that a singularity exists in the overlapping region 1100 of clusters CLa and CLb. The overlapping region 1100 of clusters CLa and CLb is a collection of coefficient data 332 belonging to both clusters CLa and CLb.

[0090] <Figure 12 Example of Independence Determination> Figure 12 is an explanatory diagram showing an example of independence determination in step S706. Independence means that changing the coefficient of one representative discriminant model RMa does not affect the other representative discriminant model RMb, and changing the coefficient of the other representative discriminant model RMb does not affect the one representative discriminant model RMa, that is, the coefficients of the representative discriminant models RMa and RMb can be changed independently. If independence is present, it indicates that the trade-off relationship between evaluation items is weak, and it is possible to treat each evaluation item independently. For this reason, it becomes easier to estimate evaluation items that have the potential for simultaneous optimization.

[0091] Specifically, for example, representative discriminant models RMa and RMb can be said to be independent if they are approximately orthogonal. For example, if RMa = a1x1 + a2x2 + a3x3 and RMb = b1x1 + b2x2 + b3x3, the analysis device 101 will determine that representative discriminant models RMa and RMb are approximately orthogonal if coefficient a3 ≈ 0 and coefficient b3 ≈ 0, and if a1 × b1 + a2 × b2 ≈ 0.

[0092] <Figure 13 Output Results> Figure 13 is an explanatory diagram showing an example of the output results in the output processing (step S707). The output results, as described above, are the cluster groups classified by the classification unit 602 for each evaluation item, the identification results by the identification unit 603, and the determination results by the determination unit 604.

[0093] The example shown in Figure 13 illustrates how the clustering results 1300, classified by evaluation item by the classification unit 602, and the identification results 1301 to 1303, performed by the identification unit 603, are output as output results. The identification results 1301 to 1303 are sample sets belonging to a region where a Pareto solution may exist.

[0094] The clustering result 1300 includes cluster groups CLa1 to CLa4 related to binding affinity, cluster groups CLb1 and CLb2 related to solubility, and cluster groups CLc1 and CLc2 related to hydrophobicity.

[0095] The analysis device 101 accepts the selection of multiple evaluation items from the input device 203 of the analysis device 101 or the input device 203 of the terminal device 102. When binding affinity, solubility, and hydrophobicity are selected, a specific result 1301 is obtained that exists on the characteristic space 1310 spanned by the three axes of binding affinity 331a, solubility 331b, and hydrophobicity 331c. The specific result 1301 is a sample population corresponding to coefficient data 332 belonging to the region where cluster CLa1 related to binding affinity, cluster CLb2 related to solubility, and cluster CLc1 related to hydrophobicity overlap.

[0096] Furthermore, if binding affinity and hydrophobicity are selected, a specific result 1302 is obtained that exists on the characteristic space 1320 spanned by the two axes of binding affinity 331a and hydrophobicity 331c. The specific result 1302 is a sample population corresponding to the coefficient data 332 belonging to the region where cluster CLa1 related to binding affinity and cluster CLc1 related to hydrophobicity overlap.

[0097] Furthermore, if solubility and hydrophobicity are selected, specific results 1303 and 1304 are obtained that exist on the characteristic space 1310 spanned by the two axes of solubility 331b and hydrophobicity 331c. Specific result 1303 is a combination of a sample population corresponding to coefficient data 332 belonging to the region where the solubility cluster CLb2 and the hydrophobic cluster CLc1 overlap, and a sample population corresponding to coefficient data 332 belonging to the region where the solubility cluster CLb2 and the hydrophobic cluster CLc2 overlap.

[0098] Furthermore, regarding the results of the singularity determination, the output unit 605 may output callouts 1311 to 1313 indicating this fact for the samples corresponding to the coefficient data 332 within the overlapping region 1100.

[0099] Furthermore, regarding the independence determination results, as shown in Figure 12, the output unit 605 may output the representative discrimination models RMa and RMb, which are substantially orthogonal to each other, along with the independently controllable coefficient a1 of the representative discrimination model RMa and the independently controllable coefficient b2 of the representative discrimination model RMb, in a manner that can be indicated by the callouts 1201 and 1202.

[0100] Thus, Example 1 has the effect of facilitating structural analysis near the Pareto front. In particular, it has the effect of facilitating structural analysis near the Pareto front when the domain is a discrete space containing data such as amino acid sequence 302. Furthermore, since clustering is performed using the coefficient data 332 of the discriminant model, even if singularities exist, the coefficient data 332 can be obtained by tangent approximation of the geodesic, and by clustering the coefficient data 332, the vicinity of the Pareto front where singularities exist can be easily searched.

[0101] Furthermore, by evaluating the discrimination accuracy with a representative discriminant model, it is possible to capture the global structure near the Pareto front. In addition, by increasing the number of clusters and decreasing the size of each cluster, a representative discriminant model closer to a linear model for each sample within a cluster can be derived, improving discrimination accuracy. Moreover, by controlling the number of clusters to reduce the variance D of the coefficient data 332 within each cluster, it becomes easier to capture the global structure near the Pareto front.

[0102] Example 2 shows an example of obtaining the discrimination model in Example 1. Note that components identical to those in Example 1 are denoted by the same reference numerals, and their descriptions may be omitted.

[0103] Artificial Intelligence (AI) has the ability to solve linearly inseparable problems, but it is unclear why the AI made such a decision. In particular, machine learning methods such as deep learning have high predictive accuracy but low explanatory power. For example, if the AI outputs a binding affinity of "0.03" for amino acid sequence 302, the user cannot explain why the AI obtained that binding affinity 331a. If the AI could determine the reason, the user could understand why the binding affinity 331a for amino acid sequence 302 is "0.03".

[0104] <Figure 14 Example of rearrangement of coefficient data 332> Figure 14 is an explanatory diagram showing an example of rearrangement of coefficient data 332. (A) In the feature space SP1, a certain coefficient data 332 is placed into a coefficient vector x ｉ Let's assume the coefficient vector x ｉ For example, it is identified as either the correct label La or Lb by the nonlinear discriminant model PM1. (B) The feature space SP2 also contains coefficient vector x ｉ A sample exists that, for example, is identified as either the correct label La or Lb by a linear discriminant model PM2.

[0105] (A) Machine learning methods such as deep learning learn a new linear regression to explain the nonlinear discriminant model PM1, which is the result of the classification. Specifically, for example, this machine learning method performs a post-hoc process of locally approximating the nonlinear discriminant model PM1 after it has been obtained. However, in such a post-hoc process, the local part of the linearly approximated nonlinear discriminant model PM1 is the coefficient vector x ｉ It is unclear whether this can be explained correctly. Furthermore, and most importantly, because it involves logistic regression using linear approximation, it requires running machine learning twice.

[0106] Since the discriminant model PM2 in (B) is linear, by referring to its slope, the coefficient vector x ｉ We can see which parameters in the feature space SP2 are weighting it, and the coefficient vector x ｉ This can be explained correctly. In Example 1, the coefficient vector x ｉ Regarding this, without obtaining a nonlinear discriminant model PM1 as in (A), multiple coefficient vectors x in the feature space SP1 ｉ This is then rearranged into another feature space SP2. This yields a linear predictive model PM2, and the coefficient vector x ｉ It is determined which parameters in the feature space SP2 are weighting the features, and the coefficient vector x is determined according to their importance. ｉ It can be explained correctly.

[0107] In other words, for each sample, the user generates the coefficient vector x ｉSince it becomes clear which factors (features) within the result contribute to the classification result, it becomes easier to explain why that particular result was obtained. Therefore, it is possible to improve the explanatory power of machine learning. In the example above, it becomes possible to understand why the AI outputted a binding affinity of "0.03" for amino acid sequence 302 of a certain sample (for example, because glycine G is located at a specific position). In addition, since it is not necessary to run machine learning twice as in (A), the efficiency of machine learning can also be improved. Therefore, explanations like the one above can be provided quickly.

[0108] <Figure 15 Example of Neural Network Structure> Figure 15 is an explanatory diagram showing an example of the structure of a neural network according to Embodiment 2. The neural network 1500 is provided in the acquisition unit 601. The neural network 1500 includes a data unit group DU, a reporting unit group RU, a harmonizing unit group HU, a reallocation unit RAU, a unifying unit UU, a decision unit DCU, and an importance unit IU.

[0109] The data unit group DU is configured by connecting multiple data units DUL (where l is the layer number, 1 ≤ l ≤ L; L is the layer number of the lowest layer, and in Figure 15, L = 4) in series. The topmost data unit DU1, l = 1, is the input layer 1501 of the neural network 1500, and the data units DUL, l ≤ 2, correspond to the intermediate layers (also called hidden layers) of the neural network 1500. Each data unit DUL is a perceptron that takes output data from the preceding data unit DU(l-1) as input, calculates output data using its own learning parameters, and outputs output data.

[0110] However, the data unit DU1 holds the training data during training. Here, the training data is, for example, a coefficient vector x ｉ and the value of the evaluation item that represents the correct data, in this case, the value of binding affinity 331a as an example. ｉ Combined with {x ｉ ,t ｉIt is composed of}. Coefficient vector x ｉ This is a K-dimensional vector.

[0111] Correct data t ｉ The coefficient vector x ｉ This is a one-dimensional vector representing the presence or absence of binding affinity 331a in amino acid sequence 302 in one-hot representation.

[0112] Coefficient vector x ｉ ∈R Ｋ (R Ｋ (where R is a real number in K dimensions) Ｋ It consists of the function h which represents the data unit DU(l+1). ｌ＋１Ｄ This is expressed by the following equation (5).

[0113]

[0114] In equation (5), the subscript l (an integer satisfying 1 ≤ l ≤ L) indicates the hierarchy number (the same applies to the following equations). L is an integer greater than or equal to 1 and indicates the deepest hierarchy number. f on the right side Ｄｌ This is the activation function. Various activation functions can be used, such as the sigmoid function, the hyperbolic tangent function (tanh function), or the ReLU (Rectified Linear Unit) function. Matrix W ｌＤ is the learning parameter of the data unit DUL. The vector h on the right-hand side. ｌＤ This is the input vector that is input to the data unit DUl, i.e., the output vector from the preceding data unit DUl. Note that this is the output vector h from data unit DU1 when the number of layers l = 1. １Ｄ is, h １Ｄ = x ｉ This is the result.

[0115] Furthermore, data unit DU1 uses the coefficient vector x as test data. ｉ It holds.

[0116] The reporting unit RUL (2 ≤ l ≤ L) outputs the output vector h from the data unit DUL at the same hierarchical level.ｌＤ Input the output vector h ｌＤ Reduces the number of dimensions. Function h that represents the reporting unit RUL. ｌＲ This is expressed by the following equation (6).

[0117]

[0118] In formula (6), matrix W ｌＲ These are the learning parameters for the reporting unit RUL. According to equation (6), the K-dimensional output vector h from the data unit DUL is obtained. ｌＤ This is the m-dimensional output vector h ｌＲ It is reduced to this. Also, σ is the sigmoid function.

[0119] A harmonizing unit HUL (2 ≤ l ≤ L) is provided between the intermediate layer data unit DUL and the reallocation unit RAU, for each intermediate layer data unit DUL. The harmonizing unit HUL converts the dimensionality of each output data from the intermediate layer data unit DUL to the same size. Therefore, the reallocation unit RAU receives output data that has been standardized to the same dimensionality by the harmonizing unit HUL.

[0120] In other words, the harmonizing unit HUl receives the output vector h from the data unit DUl at the same hierarchical level. ｌＤ Input, output vector h ｌＤ Converts the number of dimensions to the same number of dimensions. Function h that represents the harmonizing unit HUl. ｌＨ This is expressed by the following equation (7).

[0121]

[0122] In equation (7), matrix W ｌＨ These are the learning parameters for the harmonizing unit HUl. This allows us to obtain the K-dimensional output vector h from the data unit DUl.ｌＤ is converted into an m-dimensional output vector h ｌＨ . m is a hyperparameter that determines the number of dimensions. K and m may be different values from those of K and m of the reporting unit RU1. Also, f Ｈ is an activation function.

[0123] The attention unit AU calculates the weight α of each data unit DUl using the output vector h ｌＲ from each reporting unit RU1. The function α representing the attention unit AU is expressed by the following equation (8).

[0124]

[0125] In equation (8), the matrix W Ａ is a learning parameter of the attention unit AU. In the softmax function, which is one of the activation functions, a vector h Ｒ of the same dimension as the number of layers L (L = 4 in the example of the following equation (9)) is calculated. As shown in the following equation (9), the vector h Ｒ on the right side of equation (8) is a vector obtained by stacking h ｌＲ vertically.

[0126]

[0127] Therefore, the matrix W Ａ is a matrix with L rows and M columns (M is the number of elements of the vector h Ｒ ). By adopting the softmax function for the attention unit AU, each element of the vector h Ｒ of the number of layers L (the sum of all elements is 1) represents the weight of the corresponding data unit DUl.

[0128] The relocation unit RAU relocates the coefficient vector x ｉ of a certain feature space to another feature space. Specifically, for example, as shown in FIG. 14, the coefficient vector x ｉSince the resulting discriminant model may be nonlinear, the reallocation unit RAU adjusts the coefficient vector x in the feature space SP2 so that a linear discriminant model is obtained. ｉ The function h represents the reallocation unit RAU. ｌＴ This is expressed by the following equation (10).

[0129]

[0130] function f Ｔ This can be done using methods such as the Hadamard product of vectors or element addition. In this embodiment, the Hadamard product is used (see equation (11) below). Equation (11) shows the output vector h from the harmonizing unit HUl. ｌＨ and coefficient vector x ｉ This results in an Hadamard product with [the other product].

[0131]

[0132] The unifying unit UU outputs the output vector h from the reallocation unit RAU. ｌＴ The output vector α from the attention unit AU is integrated with the unifying unit UU. That is, the output vector h from the reallocation unit RAU is integrated with the unifying unit UU. ｌＴ This is weighted by the output vector α from the attention unit AU. The function h represents the unified unit UU. Ｕ This is expressed by the following equation (12).

[0133]

[0134] In equation (12), α[g] on the right-hand side represents the g-th element (weight) of the output vector α in equation (8).

[0135] The decision unit DCU uses the discriminant value y ｉ The decision unit DCU determines the output vector h from the unify unit UU and outputs it to the output layer 1503. Specifically, for example, the decision unit DCU determines the output vector h from the unify unit UU. Ｕ Next, the weight vector w is one of the learning parameters. ｏBy weighting it and applying it to the sigmoid function σ, the discriminant value y ｉ Obtain the function y that represents the decision unit DCU. ｉ This is expressed by the following equation (13). In equation (13), w ｏｔ The 't' in this signifies transposition.

[0136]

[0137] The Importance Unit (IU) is an importance vector s that indicates the importance of features in each layer of the neural network. ｌｉ The function s, which represents the importance unit IU, is calculated and output to the output layer 1503. ｌｉ This is expressed by the following equation (14).

[0138]

[0139] In equation (14), α[l] on the right-hand side represents the l-th element (weight) of the output vector α. Function f Ｔ Similar to equation (10), the Hadamard product of vectors or element addition can be used. In Example 1, the Hadamard product is used. In equation (14), the importance vector s ｌｉ is the weight vector w ｏ and the output vector h from the harmonizing unit HUl ｌＨ This is the Hadamard product with the importance vector s. ｌｉ The coefficient vector x ｉ This represents the importance level within the hierarchy.

[0140] <Example of Functional Configuration of Acquisition Unit 601> Figure 16 is a block diagram showing an example of the functional configuration of the acquisition unit 601 according to Embodiment 2. The acquisition unit 601 includes an input layer 1501, an intermediate layer 1502, an output layer 1503, a conversion unit 1601, a rearrangement unit 1602, a discrimination data calculation unit 1603, an importance calculation unit 1604, a setting unit 1605, an integration unit 1606, and a reduction unit 1607. These are examples of the internal configuration of the learning unit 412 and the discrimination unit 414.

[0141] The conversion unit 1601 converts the output vector h from each intermediate layer DUl (l≧2) as shown in equation (7). ｌＤ and matrix W ｌＨ Based on this, the output vector h ｌＤ The dimensionality K is reduced, and the output vector h after the transformation is obtained. ｌＨ It outputs the following. The conversion unit 1601 is the harmonizing unit group HU described above.

[0142] The rearrangement unit 1602 outputs the converted output vector h from the conversion unit 1601, as shown in equations (10) and (11). ｌＨ And the coefficient vector x of the feature space SP1 given to the input layer 1501 ｉ Based on this, the coefficient vector x of the feature space SP1 ｉ The elements are rearranged in the feature space SP2. The rearrangement unit 1602 is the reallocation unit RAU described above.

[0143] The discrimination data calculation unit 1603 calculates the rearrangement result h by the rearrangement unit 1602 as shown in formula (13). Ｔｌ and weight vector lol ｏ Based on this, the coefficient vector x of the feature space SP1 ｉ Discriminant vector y ｉ The determination data calculation unit 1603 is the decision unit DCU described above.

[0144] The importance calculation unit 1604 calculates the converted output vector h as shown in equation (14). ｌＨ and weight vector lol ｏ Based on this, the coefficient vector x in the hierarchy l of the intermediate layer 1502 ｉ Importance vector s ｌｉ The importance calculation unit 1604 is the importance unit IU described above.

[0145] For example, a coefficient vector x ｉ Regarding the output vector h of a certain hierarchy la, ｌａＤ This is a feature quantity that indicates the degree of influence of hydrogen bonding, and the output vector h of a certain hierarchy lb (≠ la) ｌｂＤ Let's assume that this is a feature that indicates the degree of influence of van der Waals forces. In this case, the corresponding importance vector s ｌａｉ s ｌｂｉ By referring to the coefficient vector x, the acquisition unit 601 ｉ The user can explain which factors were considered when determining that the binding affinity 331a of amino acid sequence 302 is 0.1 or less.

[0146] For example, importance vector s ｌａｉ It is low, but importance vector s ｌｂｉ If the coefficient vector x is high, the acquisition unit 601 will ｉ The user can explain that the system determines that the binding affinity 331a of amino acid sequence 302 is 0.1 or less, taking into account the influence of van der Waals forces. ｌｉ These are extracted by the importance calculation unit 1604.

[0147] The setting unit 1605 outputs the output vector h from the intermediate layer 1502, as shown in equations (8) and (9). ｌＤ and matrix W Ａ Based on this, the weight α of the intermediate layer 1502 is set. The setting unit 1605 is the attention unit AU described above.

[0148] The integration unit 1606, as shown in formula (12), rearrangement result h Ｔｌ The weight α set by the setting unit 1605 is integrated with the integrated unit UU described above. In this case, the discrimination data calculation unit 1603 calculates the integrated result h by the integrated unit 1606. ｕ and weight vector lol ｏ Based on this, the discriminant vector y ｉThe importance calculation unit 1604 calculates the weight α set by the setting unit 1605 and the converted output vector h. ｌＨ And the weight vector lol ｏ Based on this, the importance vector s ｌｉ Calculate.

[0149] The contraction section 1607 outputs the output vector h from the intermediate layer 1502, as shown in equation (6). ｌＤ and matrix W ｌＲ Based on this, the output vector h ｌＤ The dimension K is reduced, and the output vector h after reduction is obtained. ｌＲ The output is the reduction unit 1607, which is the reporting unit group RU described above. In this case, the setting unit 1605 outputs the reduced output vector h from the reduction unit 1607. ｌＲ and matrix W Ａ Based on this, the weight α of the intermediate layer 1502 is set.

[0150] The acquisition unit 601 obtains the coefficient vector x of the feature space SP1. ｉ and discriminant vector y ｉ The correct label for t ｉ Given training data including , the discriminant vector y ｉ and the correct label t ｉ Using this, the matrix W is the first learning parameter. ｌＤ , the matrix W is the second learning parameter. ｌＨ , the third learning parameter is the weight vector w ｏ , the fourth learning parameter is matrix W Ａ , and the matrix W which is the fifth learning parameter. ｌＲ For example, the correct label t ｉ and discriminant value y ｉ Optimize to minimize the cross-entropy with respect to the given value.

[0151] The discrimination unit 414 sets the optimized learning parameters to the neural network 1500 and inputs the coefficient vector x' as test data to the input layer 1501. ｉ By providing this, the discrimination vector y' is ultimately supplied to the discrimination data calculation unit 1603. ｉ To calculate it.

[0152] Thus, in Example 2, the coefficient vector x of a certain sample ｉ By pre-rearranging the neural network, it becomes possible to calculate the importance of each feature even when the neural network is multi-layered, and the sample (coefficient vector x ｎ For each sample (coefficient vector x), explanation can be simplified with high accuracy and efficiency. ｎ By pre-arranging the elements, a linear discrimination model can be obtained, allowing for the calculation of discrimination values with high accuracy and low overhead.

[0153] Furthermore, the acquisition unit 601 of Embodiment 2 includes a conversion unit 1601, a rearrangement unit 1602, and an importance calculation unit 1604. Therefore, the coefficient vector (x ｉ , x' ｉ By pre-arranging the elements, a linear discriminant model PM2 is obtained, allowing for the calculation of discriminant values with high accuracy and low overhead.

[0154] Also, the coefficient vector (x ｉ , x' ｉ The characteristics of the feature vector (x) given to the neural network as the target of analysis can be understood by the importance of each level l calculated by the importance calculation unit 1604. ｉ , x' ｉ This allows for highly accurate and efficient simplification of explanations regarding ).

[0155] Furthermore, since the acquisition unit 601 has a discrimination data calculation unit 1603, the coefficient vector (x ｉ , x' ｉ ) The discrimination result from neural network 1500 (y ｉ , y' ｉ This allows for a highly accurate and efficient simplification of explaining the reasons for obtaining the result.

[0156] Furthermore, since the acquisition unit 601 includes a setting unit 1605 and an integration unit 1606, the discrimination data calculation unit 1603 can calculate discrimination results based on the rearrangement results with high accuracy.

[0157] Furthermore, the acquisition unit 601 has a reduction unit 1607, which allows for more efficient data analysis through dimensionality reduction. In addition, the acquisition unit 601 can construct a highly accurate discrimination model by learning the learning parameters.

[0158] It should be noted that the present invention is not limited to the embodiments described above, but includes various modifications and equivalent configurations within the spirit of the appended claims. For example, the embodiments described above are described in detail to make the present invention easier to understand, and the present invention is not necessarily limited to having all the configurations described above. Furthermore, some of the configurations of one embodiment may be replaced with those of another embodiment. Furthermore, some of the configurations of one embodiment may be added to those of another embodiment. Furthermore, some of the configurations of each embodiment may be added, deleted, or replaced with other configurations.

[0159] Furthermore, each of the aforementioned configurations, functions, processing units, and processing means may be implemented in hardware, for example, by designing them as integrated circuits, or they may be implemented in software by having a processor interpret and execute programs that realize each function.

[0160] Information such as programs, tables, and files that implement each function can be stored in memory, hard disks, SSDs (Solid State Drives), or on recording media such as IC (Integrated Circuit) cards, SD cards, and DVDs (Digital Versatile Discs).

[0161] Furthermore, the control lines and information lines shown are those deemed necessary for explanation purposes and do not necessarily represent all control lines and information lines required for implementation. In reality, it can be assumed that almost all components are interconnected.

Claims

1. An analysis apparatus comprising a processor for executing a program and a storage device for storing the program, wherein the processor performs: an acquisition process for acquiring coefficient data of a linear discriminant model for determining each of a plurality of characteristics relating to an object for analysis, which are represented by discrete data having a plurality of discrete values; a classification process for classifying the coefficient data acquired for each of the discrete data by the acquisition process into a plurality of clusters for each of the characteristics; a specification process for identifying the discrete data corresponding to the discriminant model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process; and an output process for outputting the discrete data identified by the specification process.

2. An analysis apparatus according to claim 1, wherein in the classification process, the processor controls the number of clusters based on a plurality of coefficient data belonging to the cluster for each characteristic.

3. An analysis device according to claim 2, wherein in the classification process, the processor generates representative coefficient data for a representative discriminant model that represents a plurality of discriminant models of the plurality of coefficient data, based on a plurality of coefficient data belonging to the cluster for each of the plurality of clusters, and controls the number of the plurality of clusters based on the plurality of representative coefficient data.

4. An analysis apparatus according to claim 3, wherein the processor performs a determination process for determining the features of the geometric structure near the Pareto front based on a first representative discrimination model for a first cluster of a first characteristic and a second representative discrimination model for a second cluster of a second characteristic different from the first characteristic, and in the output process, the processor outputs the determination result determined by the determination process.

5. The analysis apparatus according to claim 4, wherein in the determination process, the processor determines the possibility of singularities with respect to the coefficient data belonging to the first cluster and the second cluster.

6. The analysis apparatus according to claim 4, wherein in the determination process, the processor determines the independence of the first coefficient of the first representative discriminant model and the second coefficient of the second representative discriminant model.

7. An analysis device according to claim 1, wherein in the output processing, the processor accepts the selection of two or more characteristics from the plurality of characteristics, and outputs discrete data corresponding to the coefficient data belonging to each cluster of the two or more selected characteristics.

8. The analysis apparatus according to claim 7, wherein in the output processing, the processor outputs the discrete data on a space with each of the two or more selected characteristics as an axis.

9. An analysis apparatus according to claim 1, wherein in the acquisition process, the processor uses a neural network comprising an input layer, an output layer, and two or more intermediate layers between the input layer and the output layer that calculate by providing data from a preceding layer and a first learning parameter to an activation function and outputting the calculation result to a subsequent layer, and performs a conversion process that converts each output data to the same number of dimensions based on the output data from each intermediate layer and a second learning parameter, and outputs the converted output data; and a rearrangement process that rearranges the coefficient data in a second space based on the converted output data from the conversion process and the coefficient data in a first space represented by the number of coefficients given to the input layer.

10. An analysis method to be performed by an analysis apparatus having a processor for executing a program and a storage device for storing the program, wherein the processor performs: an acquisition process for acquiring coefficient data of a linear discriminant model for determining each of a plurality of characteristics relating to an object of analysis, which is represented by discrete data having a plurality of discrete values; a classification process for classifying the coefficient data acquired for each of the discrete data by the acquisition process into a plurality of clusters for each of the characteristics; a specification process for identifying the discrete data corresponding to the discriminant model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process; and an output process for outputting the discrete data identified by the specification process.

11. An analysis program characterized by causing a processor to execute: an acquisition process for acquiring coefficient data of a linear discrimination model for determining each of several characteristics of an object to be analyzed, which are represented by discrete data having multiple discrete values; a classification process for classifying the coefficient data acquired for each discrete data by the acquisition process into multiple clusters for each characteristic; an identification process for identifying the discrete data corresponding to the discrimination model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process; and an output process for outputting the discrete data identified by the identification process.