Analysis device, analysis method, and analysis program

The analysis device and method facilitate the analysis of Pareto fronts in discrete data spaces by clustering and determining singularities, addressing the challenges of visualizing geometric structures in multi-objective optimization with discrete data.

JP2026104136APending Publication Date: 2026-06-25HITACHI LTD

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
HITACHI LTD
Filing Date
2024-12-13
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing methods struggle to analyze Pareto fronts in multi-objective optimization, particularly when dealing with discrete data such as amino acid sequences, due to the presence of singularities and challenges in visualizing the geometric structure, which becomes difficult as the number of evaluation criteria increases.

Method used

An analysis device and method that utilizes a processor to perform data acquisition, classification, and output processes to identify and visualize Pareto fronts by clustering coefficient data from discriminant models, facilitating the analysis of discrete data through singularity and independence determination.

Benefits of technology

Enables effective analysis of Pareto fronts in discrete data spaces, allowing for easier identification of optimal solutions and understanding of geometric structures, even in the presence of singularities, by using clustering and discriminant models to facilitate structural analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026104136000001_ABST
    Figure 2026104136000001_ABST
Patent Text Reader

Abstract

To facilitate the analysis of Pareto fronts. [Solution] An analysis device having a processor that executes a program and a storage device that stores the program performs the following: an acquisition process for each of a plurality of characteristics of an analysis target represented by discrete data having a plurality of discrete values, which is an acquisition process for each of the discrete data, which is an acquisition process for classifying the coefficient data acquired for each of the discrete data by the acquisition process into a plurality of clusters for each of the characteristics, which is an identification process for identifying the discrete data corresponding to the discrimination model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process, and an output process for outputting the discrete data identified by the identification process.
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] The present invention relates to an analysis device, an analysis method, and an analysis program for analyzing data. [Background technology]

[0002] In industrial product development, such as drug discovery, it is necessary to find candidates that meet the criteria for multiple evaluation items. Taking antibody development as an example, these include binding affinity, solubility, hydrophobicity, and immunogenicity, and even in early development, the number of evaluation items typically reaches several dozen.

[0003] The search process to find candidates that satisfy multiple evaluation criteria is called multi-objective optimization, but it is generally known to become difficult when the number of evaluation criteria exceeds four. In multi-objective optimization, a Pareto front is searched for evaluation criteria that have a trade-off relationship, and understanding its geometric structure is useful for search strategies and manufacturing parameter design strategies (such as how much the conditions can be relaxed in consideration of adaptation). However, when the dimensionality of the evaluation criteria exceeds four, there is a challenge in identifying and visualizing this structure.

[0004] Patent Document 1 below discloses an output program that assists in the selection of a solution. This output program causes a computer to perform the following processes: acquisition, fitting, generation, acceptance, and output. The acquisition process acquires first graph data relating to the solution on the Pareto front and second graph data relating to a plurality of evaluation values ​​corresponding to the solution. The fitting process fits a Bezier figure to each of the acquired graph data. The generation process generates a plurality of data points on the Bezier figures for each of the first and second graph data. The acceptance process accepts input conditions indicating the range of each of the plurality of evaluation values. The output process outputs the solution shown on the data points on the Bezier figure fitted to the first graph data, which correspond to the data points on the Bezier figure fitted to the second graph data selected based on the accepted conditions.

[0005] Patent Document 2 below discloses a knee-point visualization device that visualizes a sample containing noise under conditions close to those of a true, noise-free state. The knee-point visualization device identifies a weight region in which a specific knee point continuously corresponds to a knee point. The knee-point visualization device calculates the stability of each weight from among multiple weights in which a specific knee point corresponds to a knee point, based on the width of the weight region. Based on the stability, the knee-point visualization device identifies the optimal weight for visualization from among the multiple weights, and visualizes the specific knee point based on the identified weight. [Prior art documents] [Patent Documents]

[0006] [Patent Document 1] Japanese Patent Publication No. 2024-062751 [Patent Document 2] Japanese Patent Publication No. 2019-153012 [Overview of the project] [Problems that the invention aims to solve]

[0007] According to Morse theory, if the objective function, or a function perturbed by the objective function, is a Morse function, then the geometric structure of the Pareto front is a simplex. The above method implicitly assumes these favorable mathematical conditions. Furthermore, some search method requires obtaining multiple Pareto solutions beforehand, and the quality of the estimation depends on how these solutions are obtained.

[0008] On the other hand, antibody design uses amino acid sequences as input for multi-objective optimization, but amino acid sequences are discrete space data relating to a 20-character multidimensional sequence. Therefore, it is not possible to assume that the domain is a smooth manifold. Furthermore, it is not possible to make assumptions about the objective function, such as a Morse function without singularities, making Pareto front estimation by interpolation and extrapolation of Bézier simplex difficult. For example, if some amino acids in an amino acid sequence are swapped, the amino acid sequence before and after the swap will be similar in sequence, but the 3D structure will change significantly, the physical properties will change drastically, and singularities will exist. In addition, as the number of evaluation items increases, it becomes difficult to obtain a Pareto solution beforehand.

[0009] This type of problem is not limited to amino acid sequences; it can occur with other data in discrete space as well.

[0010] The present invention aims to facilitate the analysis of Pareto fronts. [Means for solving the problem]

[0011] An analysis device comprising one aspect of the invention disclosed in this application is an analysis device comprising a processor that executes a program and a storage device that stores the program, wherein the processor performs an acquisition process for acquiring coefficient data of a linear discriminant model that discriminates a characteristic for each of a plurality of characteristics relating to an object to be analyzed, which are represented by discrete data having a plurality of discrete values; a classification process for classifying the coefficient data acquired for each of the discrete data by the acquisition process into a plurality of clusters for each characteristic; a specification process for identifying the discrete data corresponding to the discriminant model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process; and an output process for outputting the discrete data identified by the specification process. [Effects of the Invention]

[0012] According to a typical embodiment of the present invention, the analysis of the Pareto front can be facilitated. Problems, configurations, and effects other than those described above will be clarified by the following description of the embodiments. [Brief explanation of the drawing]

[0013] [Figure 1] Figure 1 is an explanatory diagram showing an example of the system configuration of the analysis system. [Figure 2] Figure 2 is a block diagram showing an example of a computer hardware configuration. [Figure 3] Figure 3 is an explanatory diagram showing an example of a sample database. [Figure 4] Figure 4 is an explanatory diagram showing an example of a clustered database. [Figure 5] Figure 5 is an explanatory diagram showing an example of an overlapping database. [Figure 6] Figure 6 is a block diagram showing an example of the functional configuration of the analysis device. [Figure 7] Figure 7 is a flowchart showing an example of the analysis processing procedure performed by the analysis device. [Figure 8] Figure 8 is a flowchart showing a detailed example of the clustering process (step S704) of the discrimination model. [Figure 9] Figure 9 is an explanatory diagram showing an example of generating a representative discriminant model. [Figure 10] Figure 10 is an explanatory diagram illustrating the representative discriminant model. [Figure 11] Figure 11 is an explanatory diagram showing an example of determining the possibility of a singularity in step S706. [Figure 12] Figure 12 is an explanatory diagram showing an example of independence determination in step S706. [Figure 13] Figure 13 is an explanatory diagram showing an example of the output result in the output processing (step S707). [Figure 14] Figure 14 is an explanatory diagram showing an example of rearranging coefficient data. [Figure 15] Figure 15 is an explanatory diagram showing an example of the neural network structure according to Example 2. [Figure 16] Figure 16 is a block diagram showing an example of the functional configuration of the acquisition unit according to Embodiment 2. [Modes for carrying out the invention] [Examples]

[0014] <Figure 1: Analysis System> Figure 1 is an explanatory diagram showing an example of the system configuration of an analysis system. The analysis system 100 includes an analysis device 101 and a terminal device 102. The analysis device 101 and the terminal device 102 are connected via a network 103 such as the Internet, LAN (Local Area Network), or WAN (Wide Area Network) to enable communication.

[0015] The analysis device 101 is a computer that performs structural analysis and visualization of the vicinity of the Pareto front. The analysis device 101 includes a sample DB 104, a cluster DB 105, and an overlap DB 106. The sample DB 104, cluster DB 105, and overlap DB 106 may be provided in a computer (not shown) that can communicate with the analysis device 101 via a network 103, rather than in the analysis device 101.

[0016] Sample DB104 is a database that stores data about samples. Cluster DB105 is a database that stores data about clusters, which are sets of sample groups. Overlap DB106 is a database that stores information about the overlap between clusters.

[0017] Terminal device 102 is a computer that communicates with analysis device 101 via network 103. Terminal device 102 inputs data and instructions to analysis device 101 and displays output data from analysis device 101.

[0018] <Figure 2: Example of hardware configuration of the computer (analysis device 101, terminal device 102)> Figure 2 is a block diagram showing an example of the hardware configuration of a computer. Computer 200 includes a processor 201, a memory device 202, an input device 203, an output device 204, and a communication interface (communication IF) 205. The processor 201, memory device 202, input device 203, output device 204, and communication IF 205 are connected by a bus 206. The processor 201 controls computer 200. The memory device 202 serves as the work area for the processor 201. The memory device 202 is a non-temporary or temporary recording medium that stores various programs and data. Examples of memory devices 202 include ROM (Read Only Memory), RAM (Random Access Memory), HDD (Hard Disk Drive), and flash memory. The input device 203 takes data in. Examples of input devices 203 include a keyboard, mouse, touch panel, numeric keypad, scanner, microphone, and sensor. The output device 204 outputs data. Output devices 204 include, for example, displays, printers, and speakers. The communication IF 205 connects to the network 103 and sends and receives data.

[0019] <Figure 3 Sample DB104> Figure 3 is an explanatory diagram showing an example of sample DB104. Sample DB104 has the following fields: sample ID 301, amino acid sequence 302, and evaluation item group 303.

[0020] Sample ID 301 is an identifier that uniquely identifies the sample. A sample is the target of structural analysis near the Pareto front, and in this example, it is amino acid sequence 302. However, any sample other than amino acid sequence 302 may be used, as long as it is a sequence of multiple discrete values.

[0021] Amino acid sequence 302 is a sequence of amino acids including alanine (A), cysteine ​​(C), aspartic acid (D), glutamic acid (E), phenylalanine (F), glycine (G), histidine (H), isoleucine (I), lysine (K), leucine (L), methionine (M), asparagine (N), proline (P), glutamine (Q), arginine (R), serine (S), threonine (T), valine (V), tryptophan (W), and tyrosine (Y). In Example 1, amino acid sequence 302 is the sequence of 20 amino acids contained in the CDR3 region of the heavy chain.

[0022] Furthermore, the entries in sample DB104 may include samples that have been perturbed from a given sample. Specifically, for example, an amino acid sequence Aj (i≠j) obtained by replacing some amino acids in an amino acid sequence Ai may be included in the entries of sample DB104.

[0023] Evaluation item group 303 is a set of evaluation items. Evaluation items are the characteristics of a sample used to search for candidate samples in multi-objective optimization. Figure 3 shows examples of evaluation items, including binding affinity information 303a, solubility information 303b, and hydrophobicity information 303c. There should be at least two evaluation items.

[0024] The binding affinity information 303a includes binding affinity 331a and coefficient data 332a. Binding affinity 331a is an evaluation value indicating the strength of the binding interaction between the target amino acid sequence 302 and the ligand (e.g., a drug or inhibitor). A smaller evaluation value indicates that the target and ligand bind more easily.

[0025] The coefficient data 332a is a set of coefficients of 1 or more in a discrimination model that determines the binding affinity 331a of amino acid sequence 302, and is represented by a vector.

[0026] The binding affinity 331a discrimination model is a linear model that uses discrete values ​​Xk (where k is a number that identifies the position of the amino acid from the beginning, k=1 to K; in this example, K=20) unique to each amino acid in amino acid sequence 302 as explanatory variables, and Ya as the dependent variable, to determine whether amino acid sequence 302 has binding affinity 331a or not based on whether Ya is equal to or greater than a predetermined threshold for discriminating binding affinity 331a (see equation (1) below).

[0027]

number

[0028] ak is the k-th coefficient from the beginning in the coefficient data 332a. That is, the coefficient data 332a is (a1, a2, ..., aK).

[0029] Solubility information 303b includes solubility 331b and coefficient data 332b. Solubility 331b is an evaluation value that indicates the limit amount of amino acid sequence 302 that dissolves in a certain amount of water at a certain temperature; a smaller evaluation value indicates that amino acid sequence 302 is more easily soluble.

[0030] The coefficient data 332b is a set of coefficients (1 or more) in a discriminant model that determines the solubility 331b of amino acid sequence 302, and is represented by a vector.

[0031] The solubility 331b discrimination model is a linear model in which, with discrete values ​​Xk specific to each amino acid in amino acid sequence 302 as explanatory variables and Yb as the dependent variable, whether amino acid sequence 302 has solubility 331b is determined by whether Ya is greater than or equal to a predetermined threshold for discriminating solubility 331b (see equation (2) below).

[0032]

number

[0033] bk is the k-th coefficient from the beginning in the coefficient data 332b. That is, the coefficient data 332b is (b1, b2, ..., bK).

[0034] The hydrophobic information 303c includes hydrophobicity 331c and coefficient data 332c. Hydrophobicity 331c is an evaluation value indicating that the amino acid sequence 302 is poorly soluble in water or poorly miscible with water. A higher evaluation value indicates that the amino acid sequence 302 is poorly soluble in water or poorly miscible with water.

[0035] The coefficient data 332c is a set of one or more coefficients in a discrimination model that distinguishes between the hydrophobic 331c of amino acid sequence 302, and is represented by a vector.

[0036] The hydrophobic 331c discrimination model is a linear model that determines whether amino acid sequence 302 has hydrophobic 331c based on whether Ya is greater than or equal to a predetermined threshold for discriminating hydrophobic 331c, with the discrete value Xk, which is unique to each amino acid in amino acid sequence 302, as the explanatory variable and Yc as the dependent variable (Equation (3) below).

[0037]

number

[0038] ck is a coefficient, and it is the k-th coefficient from the beginning in the coefficient data 332c. That is, the coefficient data 332c is (c1, c2, ..., cK).

[0039] In equations (1) to (3) above, if the dependent variables Ya to Yc are not distinguished, they are referred to as the dependent variable Y. Similarly, if the coefficient data 332a to 332c are not distinguished, they are referred to as coefficient data 332. Coefficient data 332 may be given in advance or derived by a specific process. In Example 1, coefficient data 332 is assumed to be given in advance. The derivation of coefficient data 332 will be described later in Example 2.

[0040] <Figure 4 Cluster DB105> Figure 4 is an explanatory diagram showing an example of cluster DB105. Cluster DB105 is a database created by clustering the coefficient data 332 of each discrimination model for multiple samples for each evaluation item. Cluster DB105 has the following fields: sample ID 301 and cluster information 401.

[0041] Cluster information 401 includes a binding affinity cluster 401a whose evaluation item is binding affinity 331a, a solubility cluster 401b whose evaluation item is solubility 331b, and a hydrophobic cluster 401c whose evaluation item is hydrophobicity 331c. In the binding affinity cluster 401a, two or more clusters Ca1, Ca2, ... are generated; in the solubility cluster 401b, two or more clusters Cb1, Cb2, ... are generated; and in the hydrophobic cluster 401c, two or more clusters Cc1, Cc2, ... are generated.

[0042] The entry for each sample identified by sample ID 301 in cluster DB105 indicates which cluster the coefficient data 332 of the discrimination model for that sample belongs to. "1" indicates that it belongs to that cluster, and "0" indicates that it does not belong to that cluster. For example, the coefficient data 332 of the discrimination model for amino acid sequence 302 with sample ID 301 "S1" can be seen to belong to clusters Ca1 and Ca2, which are binding affinity clusters 401a, to cluster Cb1, which is solubility cluster 401b, and to cluster Cc2, which is hydrophobic cluster 401c.

[0043] <Figure 5 Overlap DB106> Figure 5 is an explanatory diagram showing an example of overlap DB106. Overlap DB106 is a database created by the process of identifying overlaps between clusters of different evaluation items in cluster DB105. Overlap DB106 has the following fields: cluster combination 501, member sample ID 502, and sample number 503.

[0044] Cluster combination 501 is a brute-force combination of clusters between different evaluation items. In this example, there are three evaluation items: binding affinity 331a, solubility 331b, and hydrophobicity 331c. Therefore, cluster combination 501 includes combinations of clusters for binding affinity 331a, solubility 331b, and hydrophobicity 331c; combinations of clusters for binding affinity 331a and solubility 331b; combinations of clusters for binding affinity 331a and hydrophobicity 331c; and combinations of clusters for solubility 331b and hydrophobicity 331c.

[0045] The assigned sample ID 502 is the sample ID 301 of the amino acid sequence 302 that corresponds to the coefficient data 332 belonging to any of the cluster combinations 501. For example, in the entry for the second row, the assigned sample ID 502 for the combination of cluster Ca1 with binding affinity 331a, cluster Cb1 with solubility 331b, and cluster Cc2 with hydrophobicity 331c is "S1".

[0046] This indicates that amino acid sequence 302: "A1", for sample ID 301 "S1", belongs to any of the clusters: cluster Ca1 for binding affinity 331a, cluster Cb1 for solubility 331b, and cluster Cc2 for hydrophobicity 331c. In other words, in the entry for sample ID 301 "S1" in cluster DB105, the values ​​for cluster Ca1 for binding affinity 331a, cluster Cb1 for solubility 331b, and cluster Cc2 for hydrophobicity 331c are all set to "1".

[0047] If sample ID 502 does not exist, it indicates that there are no overlapping coefficient data 332 in that cluster combination 501. The sample number 503 is the number of samples with sample ID 502.

[0048] <Figure 6: Example of Functional Configuration of Analysis Device 101> Figure 6 is a block diagram showing an example of the functional configuration of the analysis device 101. The analysis device 101 includes an acquisition unit 601, a classification unit 602, a specific unit 603, a determination unit 604, and an output unit 605. Specifically, the acquisition unit 601, classification unit 602, specific unit 603, determination unit 604, and output unit 605 are realized, for example, by causing the processor 201 to execute a program stored in the storage device 202 shown in Figure 2.

[0049] The acquisition unit 601 acquires the amino acid sequence 302 and its coefficient data 332a, 332b, and 332c from the sample DB 104.

[0050] The classification unit 602 clusters the coefficient data 332 of each discrimination model for the multiple amino acid sequences 302 acquired by the acquisition unit 601 for each evaluation item. That is, the classification unit 602 classifies the coefficient data 332 of the multiple discrimination models into multiple clusters for each evaluation item. The classification unit 602 identifies the sample (amino acid sequence 302) of coefficient data 332 belonging to each cluster and generates cluster information 401.

[0051] The classification unit 602 clusters the set of coefficient data 332 in the feature space, which uses these K=20-dimensional coefficients as features, according to the evaluation items. Therefore, in this example, a group of binding affinity clusters 401a, a group of solubility clusters 4014b, and a group of hydrophobic clusters 401c are generated.

[0052] The identification unit 603 identifies overlaps between clusters of different evaluation items. Specifically, for example, the identification unit 603 creates an overlap DB 106 by referring to the cluster DB 105 and identifying samples of coefficient data 332 where clusters of different evaluation items overlap.

[0053] The determination unit 604 determines the characteristics of the geometric structure near the Pareto front. Specifically, for example, the determination unit 604 performs singularity determination and independence determination.

[0054] Singularity possibility determination is the process of determining whether or not there is a possibility of a singularity between two overlapping clusters. Specifically, for example, the determination unit 604 determines that there is a possibility of a singularity between representative discriminant models if the signs of the coefficient data 332 of the representative discriminant model of one cluster and the signs of the coefficient data 332 of the representative discriminant model of the other cluster are inverted. A representative discriminant model is a discriminant model that uses coefficient data obtained by statistically processing (for example, averaging) the coefficient data 332 within the cluster.

[0055] This result makes it easier to search for singularities. In other words, it indicates that the optimal point for at least one of the evaluation items in both overlapping clusters is likely to be near a singularity. Singularities are critical points where the phenotype undergoes a phase transition and are biologically distinctive.

[0056] Furthermore, independence determination is the process of determining the independence of the representative discriminant models for two overlapping clusters. Independence of representative discriminant models means that the overlap between the two clusters is maintained even if the coefficient at a certain sequence position k=p of one representative discriminant model, or the coefficient at a certain sequence position k=q (≠p) of the other representative discriminant model, is increased or decreased. This determination result makes it easier to estimate evaluation items that may be subject to simultaneous optimization.

[0057] The output unit 605 outputs various types of data. Specifically, for example, the output unit 605 outputs the cluster groups classified by the classification unit 602 for each evaluation item, the identification results by the identification unit 603, and the judgment results by the judgment unit 604 in a displayable format. The identification results indicate regions where a Pareto solution may exist. Specifically, for example, the identification results are sample IDs 301 corresponding to the coefficient data 332 in regions where clusters between different evaluation items overlap. Sample ID 301 is displayed as a point on the space spanned by the overlapping evaluation items (hereinafter referred to as the characteristic space).

[0058] For example, if sample ID 301 has amino acid sequence 302 "A1" (which corresponds to "S1"), and this overlaps with the clusters for binding affinity 331a and solubility 331b, then sample ID 301 will be displayed as a point in the characteristic space spanned by the axes of binding affinity 331a and solubility 331b, where the value of binding affinity 331a is Fa1 and the value of solubility 331b is Fb1. The value S1 of sample ID 301, the value A1 of amino acid sequence 302, the value Fa1 of binding affinity 331a, and the value Fb1 of solubility 331b are associated with this point, and this information can be displayed, for example, by pointing the mouse at this point.

[0059] Here, the combination of overlapping properties is referred to as binding affinity 331a and solubility 331b, but any combination of at least two of the following is acceptable: binding affinity 331a, solubility 331b, and hydrophobicity 331c.

[0060] The output unit 605 may display this output data on a display, which is an example of an output device 204, or it may transmit it from the communication IF 205 to the terminal device 102 via the network 103 and have it displayed on a display, which is an example of an output device 204 of the terminal device 102.

[0061] <Figure 7: Example of analysis processing procedure by analysis device 101> Figure 7 is a flowchart showing an example of the analysis processing procedure performed by the analysis device 101.

[0062] (Step S701) The analysis device 101 determines whether or not there are any unselected evaluation items. If there are any unselected evaluation items (step S701: Yes), the process proceeds to step S702. If there are no unselected evaluation items (step S701: No), the process proceeds to step S705.

[0063] (Step S702) The analysis device 101 selects one unselected evaluation item. The selected evaluation item is referred to as the selected evaluation item.

[0064] (Step S703) The analysis device 101 uses the acquisition unit 601 to acquire coefficient data 332 of the discrimination model for the selected evaluation items from the sample DB 104 for each sample.

[0065] (Step S704) The analysis device 101 performs clustering of the discrimination model using the classification unit 602 and returns to step S801. Details of the clustering process of the discrimination model will be described later.

[0066] (Step S705) The analysis device 101 uses the identification unit 603 to identify samples of coefficient data 332 where clusters between evaluation items overlap, and creates an overlap DB 106.

[0067] (Step S706) The analysis device 101, using the determination unit 604, determines the possibility and independence of singularities based on the coefficient data of representative linear models for both overlapping clusters.

[0068] (Step S707) The analysis device 101 outputs the sample amino acid sequence 302, the cluster groups classified by the classification unit 602 according to evaluation items, the identification results from the identification unit 603, and the judgment results from the judgment unit 604 in a displayable format via the output unit 605. The identification results from the identification unit 603 indicate the region where a Pareto solution may exist. This completes the series of processes.

[0069] <Figure 8 Clustering process of the discriminant model (step S704)> Figure 8 is a flowchart showing a detailed example of the clustering process (step S704) of the discrimination model. Step S704 is performed on the selection evaluation items selected in step S702.

[0070] (Step S801) The analysis device 101 sets an initial value for the number of clusters. The set number of clusters is referred to as the set number of clusters C. The initial value of the set number of clusters C is, for example, C = 2. The upper limit of the set number of clusters P is called the upper limit of clusters Pmax. The upper limit of clusters Pmax is an integer of 2 or more and is set in advance.

[0071] (Step S802) The analysis device 101 clusters the sample group of the selected evaluation item so that the set number of clusters P is determined based on the coefficient data 332 of the discrimination model for each sample (amino acid sequence 302).

[0072] (Step S803) The analysis device 101 calculates the average value for each coefficient at the same position in the multiple coefficient data included in the cluster for each cluster in the cluster group generated in step S802, and generates a representative discriminant model for each cluster that represents the cluster.

[0073] [Figure 9: Example of generating a representative discriminant model] Figure 9 is an explanatory diagram illustrating an example of representative discriminant model generation. In Figure 9, the selection evaluation item is assumed to be binding affinity. The coefficient data group 900 is the set of coefficient data 332. In the example in Figure 9, since the selection evaluation item is binding affinity, the coefficient data group 900 is the set of coefficient data 332a.

[0074] Assume that in step S802, the coefficient data set 900 was classified into P clusters, CL1 to CLP. The sample ID 301, amino acid sequence 302, and coefficient data 332a for binding affinity information 303a of a sample belonging to cluster CLp within clusters CL1 to CLP are as shown in Table 901. Table 901 is an excerpt of information from sample DB104, showing some of the entries for samples belonging to cluster CLp. p is an integer satisfying 2 ≤ p ≤ P.

[0075] As shown in Table 901, the sample IDs 301 of the samples belonging to cluster CLp are, for example, six of S1, S2, S15, S28, S33, and S47. The analysis device 101 calculates the average value of the coefficients at the same positions of Va1, Va2, Va15, Va28, Va33, and Va47, which are the coefficient data 332a of the samples belonging to cluster CLp.

[0076] Specifically, for example, taking the coefficients a11, a12, a115, a128, a133, and a147 at the leading positions of the coefficient data Va1, Va2, Va15, Va28, Va33, and Va47 as an example, the analysis device 101 calculates the average value a av 1 of the coefficients a11, a12, a115, a128, a133, and a147.

[0077] Similarly for the coefficients from the second position to the k-th position, the analysis device 101 calculates the average values a av 2, a av 15, a av 28, a av 33, a av 47. The average values a av 1, a av 2, a av 15, a av 28, a av 33, a av 47 at each position are the representative coefficient data Va av (p) in cluster CLp. The linear model using the representative coefficient data Va av (p) in the K-dimensional feature space (if the amino acid sequence 302, then K = 20) with coefficients a av 1, a av 2, a av 15, a av 28, a av 33, a av 47 becomes the representative discrimination model of cluster CLp.

[0078] [Figure 10 Representative Discrimination Model] Figure 10 is an explanatory diagram illustrating a representative discriminant model. In Figure 10, binding affinity is used as an example of the selection evaluation item. In Figure 10, a condition-satisfying sample is a sample that satisfies the condition (in the case of binding affinity, binding affinity 331a is below the threshold), and a condition-violating sample is a sample that violates the condition (in the case of binding affinity, binding affinity 331a is greater than the threshold). As shown in Figure 9, representative coefficient data Va av By calculating (p), the representative discriminant model RMp for cluster CLp is obtained. The same applies to the representative discriminant model RM1 for cluster CL1 and the representative discriminant model RMP for cluster CLP.

[0079] (Step S804) Returning to Figure 8, the analysis device 101 calculates the discrimination accuracy and variance of the representative discriminant model for each cluster. Specifically, for example, the analysis device 101 obtains all sample IDs 301 belonging to cluster CLp using cluster DB 105. The analysis device 101 extracts the amino acid sequence 302 of each sample ID 301 belonging to cluster CLp from sample DB 104. The analysis device 101 divides the number of positive values ​​obtained when the discrete value Xk of the amino acid sequence 302 of each extracted sample ID 301 is substituted into the representative discriminant model RMp for cluster CLp by the number of sample IDs 301 belonging to cluster CLp. This division result is the discrimination accuracy. In Example 1, a sample is classified as correct if the representative discriminant model RMp is positive.

[0080] A higher representative discriminant model accuracy indicates better performance in classifying the evaluation item.

[0081] The variance D is calculated using the following equation (4). In equation (4), y on the right-hand side is the result of substituting the discrete values ​​Xk of each amino acid in the amino acid sequence 302 of a sample within cluster CLp into a discriminant model represented by the coefficient data 332 of that sample. yp on the right-hand side is the result of substituting the discrete values ​​Xk of each amino acid in the amino acid sequence 302 of a sample within cluster CLp into the representative discriminant model RMp of cluster CLp. t is the number of samples within cluster CLp. A smaller variance D indicates less variability in the coefficient data 332 from the representative discriminant model RMp.

[0082]

number

[0083] (Step S805) The analysis device 101 calculates the information standard quantity for each cluster and calculates the average information standard quantity, which is the average of the information standard quantities. The information standard quantity is the sum of the discrimination accuracy and the variance D.

[0084] (Step S806) The analysis device 101 determines whether the set number of clusters P is equal to the upper limit number of clusters Pmax. If the set number of clusters P is not equal to the upper limit number of clusters Pmax (step S806: No), the process proceeds to step S807. If the set number of clusters P is equal to the upper limit number of clusters Pmax (step S806: Yes), the process proceeds to step S808.

[0085] (Step S807) The analysis device 101 increments the set number of clusters P and returns to step S802. As a result, steps S802 to S805 are repeatedly executed until the set number of clusters P reaches the upper limit of the number of clusters Pmax.

[0086] (Step S808) The analysis device 101 adopts the clustering result with the minimum number of clusters that minimizes the average information standard, and returns to step S701. The adopted clustering result is used in steps S705 to S707.

[0087] <Figure 11 Example of determining the possibility of a singularity> Figure 11 is an explanatory diagram illustrating an example of singularity determination in step S706. In Figure 11, cluster CLa is one of the cluster groups whose evaluation item is binding affinity, and representative discriminant model RMa is the representative discriminant model for cluster CLa. Similarly, cluster CLb is one of the cluster groups whose evaluation item is solubility, and representative discriminant model RMb is the representative discriminant model for cluster CLb. To simplify the explanation, let K=3. In this case, representative discriminant models RMa and RMb are expressed as first-order linear models with K=3 terms (Figure 12 is similar).

[0088] The analysis device 101 determines whether the signs of all coefficients at the same position are reversed between the representative discriminant models RMa and RMb. If the signs are all reversed, it is determined that there is a possibility of a singularity existing in the overlapping region 1100 of clusters CLa and CLb. The overlapping region 1100 of clusters CLa and CLb is a collection of coefficient data 332 belonging to both clusters CLa and CLb.

[0089] <Figure 12 Example of independence determination> Figure 12 is an explanatory diagram showing an example of independence determination in step S706. Independence means that changing the coefficient of one representative discriminant model RMa does not affect the other representative discriminant model RMb, and changing the coefficient of the other representative discriminant model RMb does not affect the first representative discriminant model RMa; in other words, the coefficients of representative discriminant models RMa and RMb can be changed independently. If independence is present, it indicates that the trade-off relationship between evaluation items is weak, and it is possible to treat each evaluation item independently. Therefore, it becomes easier to estimate evaluation items that have the potential for simultaneous optimization.

[0090] Specifically, for example, representative discriminant models RMa and RMb can be said to be independent if they are approximately orthogonal. For example, RMa = a1x1 + a2x2 + a3x3 RMb = b1x1 + b2x2 + b3x3 In that case, if coefficient a3 ≈ 0 and coefficient b3 ≈ 0, the analysis device 101 will determine that If a1×b1+a2×b2≈0, then the representative discriminant models RMa and RMb are approximately orthogonal.

[0091] <Figure 13 Output Result> Figure 13 is an explanatory diagram showing an example of the output results in the output processing (step S707). The output results, as described above, are the cluster groups classified by the classification unit 602 for each evaluation item, the identification results by the identification unit 603, and the determination results by the determination unit 604.

[0092] The example shown in Figure 13 illustrates how the clustering results 1300, classified by evaluation item by the classification unit 602, and the identification results 1301 to 1303, performed by the identification unit 603, are output as output results. Identification results 1301 to 1303 are sample sets belonging to regions where Pareto solutions may exist.

[0093] The clustering result 1300 includes cluster groups CLa1 to CLa4 related to binding affinity, cluster groups CLb1 and CLb2 related to solubility, and cluster groups CLc1 and CLc2 related to hydrophobicity.

[0094] The analysis device 101 accepts the selection of multiple evaluation items from the input device 203 of the analysis device 101 or the input device 203 of the terminal device 102. If binding affinity, solubility, and hydrophobicity are selected, a specific result 1301 is obtained that exists on the characteristic space 1310 spanned by the three axes of binding affinity 331a, solubility 331b, and hydrophobicity 331c. The specific result 1301 is a sample population corresponding to coefficient data 332 belonging to the region where cluster CLa1 for binding affinity, cluster CLb2 for solubility, and cluster CLc1 for hydrophobicity overlap.

[0095] Furthermore, if binding affinity and hydrophobicity are selected, a specific result 1302 is obtained that exists on the characteristic space 1320 spanned by the two axes of binding affinity 331a and hydrophobicity 331c. The specific result 1302 is a sample population corresponding to the coefficient data 332 belonging to the overlapping region of cluster CLa1 related to binding affinity and cluster CLc1 related to hydrophobicity.

[0096] Furthermore, if solubility and hydrophobicity are selected, specific results 1303 and 1304 are obtained that exist on the characteristic space 1310 spanned by the two axes of solubility 331b and hydrophobicity 331c. Specific result 1303 is a combination of a sample population corresponding to coefficient data 332 belonging to the region where cluster CLb2 related to solubility and cluster CLc1 related to hydrophobicity overlap, and a sample population corresponding to coefficient data 332 belonging to the region where cluster CLb2 related to solubility and cluster CLc2 related to hydrophobicity overlap.

[0097] Furthermore, regarding the results of the singularity determination, the output unit 605 may output callouts 1311 to 1313 indicating this for the samples corresponding to the coefficient data 332 within the overlapping region 1100.

[0098] Furthermore, regarding the independence determination results, as shown in Figure 12, the output unit 605 may output the representative discrimination models RMa and RMb, which are substantially orthogonal to each other, along with the independently controllable coefficient a1 of the representative discrimination model RMa and the independently controllable coefficient b2 of the representative discrimination model RMb, in a manner that can be indicated by the callouts 1201 and 1202.

[0099] Thus, Example 1 has the effect of facilitating structural analysis near the Pareto front. In particular, it has the effect of facilitating structural analysis near the Pareto front when the domain is a discrete space containing data such as amino acid sequence 302. Furthermore, since clustering is performed using the coefficient data 332 of the discriminant model, even if singularities exist, the coefficient data 332 can be obtained by tangent approximation of the geodesic, and by clustering the coefficient data 332, the vicinity of the Pareto front where singularities exist can be easily searched.

[0100] Furthermore, by evaluating the discrimination accuracy with a representative discriminant model, it is possible to capture the global structure near the Pareto front. In addition, by increasing the number of clusters and decreasing the size of each cluster, a representative discriminant model closer to a linear model for each sample within a cluster can be derived, improving discrimination accuracy. Moreover, by controlling the number of clusters to reduce the variance D of the coefficient data 332 within each cluster, it becomes easier to capture the global structure near the Pareto front. [Examples]

[0101] Example 2 shows an example of obtaining the discrimination model in Example 1. Note that components identical to those in Example 1 are denoted by the same reference numerals, and their descriptions may be omitted.

[0102] Artificial intelligence (AI) has the ability to solve linearly inseparable problems, but it is unclear why the AI ​​made a particular decision. In particular, machine learning methods such as deep learning have high predictive accuracy but low explanatory power. For example, if an AI outputs a binding affinity of "0.03" for amino acid sequence 302 and 331a, the user cannot explain why the AI ​​arrived at that binding affinity of 331a. If the AI ​​could determine the reason, the user could understand why the binding affinity of amino acid sequence 302 and 331a is "0.03".

[0103] <Figure 14: Example of rearrangement of coefficient data 332> Figure 14 is an explanatory diagram showing an example of rearranging coefficient data 332. (A) In the feature space SP1, a certain coefficient data 332 is placed into a coefficient vector x i Let's assume that the coefficient vector x i For example, it is identified as either the correct label La or Lb by the nonlinear discriminant model PM1. (B) The feature space SP2 also contains coefficient vector x i Such a sample exists and, for example, is identified by a linear discriminant model PM2 as either the correct label La or Lb.

[0104] (A) Machine learning methods such as deep learning learn a new linear regression to explain the nonlinear discriminant model PM1, which is the result of the classification. Specifically, for example, this machine learning method performs a post-hoc process of locally approximating the nonlinear discriminant model PM1 with a straight line after it has been obtained. However, in such a post-hoc process, the local part of the linearly approximated nonlinear discriminant model PM1 is the coefficient vector x i It is unclear whether this can be explained correctly. Furthermore, and most importantly, because it involves logistic regression using linear approximation, it requires running machine learning twice.

[0105] Since the discriminant model PM2 in (B) is linear, by referring to its slope, the coefficient vector x i We can see which parameters in the feature space SP2 are weighting the coefficient vector x i This can be explained correctly. In Example 1, the coefficient vector x i Regarding this, without obtaining a nonlinear discriminant model PM1 as in (A), multiple coefficient vectors x in the feature space SP1 i We rearrange these into another feature space SP2. This gives us a linear predictive model PM2, so the coefficient vector x i We can see which parameters in the feature space SP2 are weighting the features, and the coefficient vector x is determined according to their importance. i It can be explained correctly.

[0106] In other words, the user, for each sample, generates the coefficient vector x i This allows us to understand which factors (features) within the result contribute to the classification result, making it easier to explain why that result was obtained. Therefore, it is possible to improve the explanatory power of machine learning. In the example above, it becomes possible to understand why the AI ​​outputted a binding affinity of "0.03" for amino acid sequence 302 of a certain sample (for example, because glycine G is located at a specific position). In addition, since it is not necessary to run machine learning twice as in (A), the efficiency of machine learning can also be improved. Therefore, explanations like the one above can be provided quickly.

[0107] <Figure 15: Example of a neural network structure> Figure 15 is an explanatory diagram showing an example of the structure of a neural network according to Embodiment 2. The neural network 1500 is provided in the acquisition unit 601. The neural network 1500 includes a data unit group DU, a reporting unit group RU, a harmonizing unit group HU, a reallocation unit RAU, a unifying unit UU, a decision unit DCU, and an importance unit IU.

[0108] The data unit group DU is configured by connecting multiple data units DUL (where l is the layer number, 1 ≤ l ≤ L; L is the layer number of the lowest layer, L=4 in Figure 15) in series. The topmost data unit DU1, l=1, is the input layer 1501 of the neural network 1500, and the data units DUL, l ≤ 2, correspond to the intermediate layers (also called hidden layers) of the neural network 1500. Each data unit DUL is a perceptron that takes output data from the preceding data unit DU(l-1) as input, calculates output data using its own learning parameters, and outputs its own output data.

[0109] However, the data unit DU1 holds the training data during training. Here, the training data is, for example, a coefficient vector x iand the values ​​of the evaluation items that represent the correct data, in this case, the value of binding affinity 331a as an example. i combination with {x i ,t i It is composed of}. Coefficient vector x i This is a K-dimensional vector.

[0110] Correct data t i is the coefficient vector x i This is a one-dimensional vector representing the presence or absence of binding affinity 331a at amino acid sequence 302, expressed in one-hot format.

[0111] coefficient vector x i ∈R K (R K (where R is a real number of K dimensions) K It consists of the function h which represents the data unit DU(l+1). l+1 D This is expressed by the following equation (5).

[0112]

number

[0113] In equation (5), the subscript l (an integer satisfying 1 ≤ l ≤ L) indicates the hierarchy number (the same applies to the following equations). L is an integer greater than or equal to 1 and indicates the deepest hierarchy number. f on the right-hand side D l This is the activation function. Various activation functions can be used, such as the sigmoid function, the hyperbolic tangent function (tanh function), or the ReLU (Rectified Linear Unit) function. Matrix W l D is the learning parameter of the data unit DUL. The vector h on the right-hand side. l D This is the input vector that is input to the data unit DUL, i.e., the output vector from the preceding data unit DUL. Note that this is the output vector h from data unit DU1 when the number of layers l=1. 1 D is, h 1 D=x i This is the result.

[0114] Note that data unit DU1 uses the coefficient vector x as test data. i It holds.

[0115] The reporting unit RUL (2 ≤ l ≤ L) outputs the output vector h from the data unit DUL at the same hierarchical level. l D Input the output vector h l D Reduces the dimensionality. Function h that represents the reporting unit RUL. l R This is expressed by the following equation (6).

[0116]

number

[0117] In equation (6), matrix W l R is the learning parameter of the reporting unit RUL. According to equation (6), the K-dimensional output vector h from the data unit DUL is obtained. l D This is the m-dimensional output vector h l R It is reduced to this. Also, σ is the sigmoid function.

[0118] A harmonizing unit HUL (2 ≤ l ≤ L) is provided between the intermediate layer data unit DUL and the reallocation unit RAU, for each intermediate layer data unit DUL. The harmonizing unit HUL converts the dimensionality of each output data from the intermediate layer data unit DUL to the same size. Therefore, the reallocation unit RAU receives output data that has been standardized to the same dimensionality by the harmonizing unit HUL.

[0119] In other words, the harmonizing unit Hul receives the output vector h from the data unit DUL at the same hierarchical level. lD Input, output vector h l D Converts the number of dimensions to the same number of dimensions. Function h that represents the harmonizing unit HUL. l H This is expressed by equation (7) below.

[0120]

number

[0121] In equation (7), matrix W l H These are the learning parameters for the harmonizing unit Hul. This allows us to obtain the K-dimensional output vector h from the data unit Dul. l D This is the m-dimensional output vector h l H It is converted to m, where m is a hyperparameter that determines the number of dimensions. K and m may be different values ​​from those of the reporting unit R1. Also, f H This is the activation function.

[0122] The attention unit AU receives the output vector h from each reporting unit Rul. l R The weight α of each data unit DUL is calculated using this method. The function α representing the attention unit AU is expressed by equation (8) below.

[0123]

number

[0124] In equation (8), matrix W A is the learning parameter of the attention unit AU. In the softmax function, one of the activation functions, a vector h of dimension equal to the number of layers L (L=4 in the example of equation (9) below) is used. R The following is calculated. As shown in equation (9) below, the vector h on the right side of equation (8) R is, h l Ris a vector stacked vertically.

[0125]

Number

[0126] Therefore, the matrix W A is a matrix with L rows and M columns (M is the number of elements of the vector h R . By adopting the softmax function for the attention unit AU, each element of the vector h R with L layers (the sum of all elements is 1) represents the weight of the corresponding data unit DUl.

[0127] The relocation unit RAU relocates the coefficient vector x i of a certain feature space to another feature space. Specifically, for example, as shown in FIG. 14, since the discrimination model obtained by the coefficient vector x i on the feature space SP1 can be non - linear, the relocation unit RAU moves the coefficient vector x i to the feature space SP2 so that a linear discrimination model can be obtained in the feature space SP2. The function h l T representing the relocation unit RAU is expressed by the following formula (10).

[0128]

Number

[0129] The function f T can use the Hadamard product between vectors, element - wise addition, etc. In this embodiment, the Hadamard product is used (see the following formula (11)). In formula (11), it is the Hadamard product of the output vector h l H from the harmonizing unit HUl and the coefficient vector x i .

[0130]

Number

[0131] The unifying unit UU outputs the output vector h from the reallocation unit RAU. l T The output vector α from the attention unit AU is then integrated. That is, the unifying unit UU integrates the output vector h from the reallocation unit RAU. l T This is weighted by the output vector α from the attention unit AU. The function h represents the unified unit UU. U This is expressed by the following equation (12).

[0132]

number

[0133] In equation (12), α[g] on the right-hand side represents the g-th element (weight) of the output vector α in equation (8).

[0134] The decision unit DCU determines the discriminant value y i The decision is determined and output to the output layer 1503. Specifically, for example, the decision unit DCU outputs the output vector h from the unify unit UU. U Next, the weight vector w, which is one of the learning parameters. o By weighting it and applying it to the sigmoid function σ, the discriminant value y i Obtain the function y that represents the decision unit DCU. i This is expressed by the following equation (13). In equation (13), w o t The 't' in this signifies transposition.

[0135]

number

[0136] The Importance Unit (IU) is an importance vector (s) that indicates the importance of features in each layer of the neural network.l i The function s, which represents the importance unit IU, is calculated and output to output layer 1503. l i This is expressed by the following equation (14).

[0137]

number

[0138] In equation (14), α[l] on the right-hand side represents the l-th element (weight) of the output vector α. Function f T Similar to equation (10), the Hadamard product of vectors or element addition can be used. In Example 1, the Hadamard product is used. In equation (14), the importance vector s l i This is the weight vector w o and the output vector h from the harmonizing unit Hul l H This is the Hadamard product with the importance vector s. l i is the coefficient vector x i This represents the importance level within the hierarchy.

[0139] <Example of functional configuration of acquisition unit 601> Figure 16 is a block diagram showing an example of the functional configuration of the acquisition unit 601 according to Embodiment 2. The acquisition unit 601 includes an input layer 1501, an intermediate layer 1502, an output layer 1503, a conversion unit 1601, a rearrangement unit 1602, a discrimination data calculation unit 1603, an importance calculation unit 1604, a setting unit 1605, an integration unit 1606, and a reduction unit 1607. These are examples of the internal configurations of the learning unit 412 and the discrimination unit 414.

[0140] The conversion unit 1601 converts the output vector h from each intermediate layer DUl(l≧2) as shown in equation (7). l D and matrix W l H Based on this, the output vector h l D The dimensionality K is reduced, and the output vector h after the transformation is obtained.l H It outputs the following. The conversion unit 1601 is the harmonizing unit group HU described above.

[0141] The rearrangement unit 1602 outputs the converted output vector h from the conversion unit 1601, as shown in equations (10) and (11). l H And the coefficient vector x of the feature space SP1 given to input layer 1501 i Based on this, the coefficient vector x of the feature space SP1 i The elements are rearranged in the feature space SP2. The rearrangement unit 1602 is the reallocation unit RAU described above.

[0142] The discrimination data calculation unit 1603 calculates the rearrangement result h by the rearrangement unit 1602 as shown in equation (13). T l and weight vector w o Based on this, the coefficient vector x of the feature space SP1 i Discriminant vector y i The following is calculated. The discrimination data calculation unit 1603 is the decision unit DCU described above.

[0143] The importance calculation unit 1604 calculates the converted output vector h as shown in equation (14). l H and weight vector w o Based on this, the coefficient vector x in hierarchy l in the intermediate layer 1502 i Importance vector s l i The importance calculation unit 1604 is the importance unit IU described above.

[0144] For example, a coefficient vector x i Regarding a certain hierarchy la, the output vector h la D This is a feature that indicates the degree of influence of hydrogen bonding, and the output vector h of a certain hierarchy lb (≠la) lb D Let's assume that this is a feature that indicates the degree of influence of van der Waals forces. In this case, the corresponding importance vector sla i , s lb i By referring to this, the acquisition unit 601 obtains the coefficient vector x i The user can explain which factors were considered when determining that the binding affinity 331a of amino acid sequence 302 is 0.1 or less.

[0145] For example, importance vector s la i It is low, but importance vector s lb i If the coefficient vector x is high, the acquisition unit 601 will i The user can explain that the system determines that the binding affinity 331a of amino acid sequence 302 is 0.1 or less, taking into account the influence of van der Waals forces. l i These are extracted by the importance calculation unit 1604.

[0146] The setting unit 1605 outputs the output vector h from the intermediate layer 1502, as shown in equations (8) and (9). l D and matrix W A Based on this, the weight α of the intermediate layer 1502 is set. The setting unit 1605 is the attention unit AU described above.

[0147] The integration unit 1606, as shown in equation (12), rearranges the result h T l The weight α set by the setting unit 1605 is integrated with the integrated unit UU described above. In this case, the discrimination data calculation unit 1603 calculates the integrated result h by the integrated unit 1606. u and weight vector w o Based on this, the discriminant vector y i The importance calculation unit 1604 calculates the weight α set by the setting unit 1605 and the converted output vector h. l H And the weight vector w o Based on this, importance vector s li Calculate.

[0148] The reduction section 1607 outputs the output vector h from the intermediate layer 1502, as shown in equation (6). l D and matrix W l R Based on this, the output vector h l D The dimension K is reduced, and the output vector h after reduction is obtained. l R It outputs the following. The reduction unit 1607 is the reporting unit group RU described above. In this case, the setting unit 1605 outputs the reduced output vector h from the reduction unit 1607. l R and matrix W A Based on this, we set the weights α of the hidden layer 1502.

[0149] The acquisition unit 601 obtains the coefficient vector x of the feature space SP1. i and discriminant vector y i The correct label for t i Given training data including , the discriminant vector y i and the correct label t i Using this, the matrix W is the first learning parameter. l D , the matrix W is the second learning parameter. l H , the third learning parameter is the weight vector w o , the matrix W is the fourth learning parameter. A , and the matrix W which is the fifth learning parameter. l R For example, the correct label t i and the discriminant value y i Optimize to minimize the cross-entropy with respect to the given value.

[0150] The discrimination unit 414 sets the optimized learning parameters to the neural network 1500 and inputs the coefficient vector x' as test data to the input layer 1501. i By providing this, the discrimination vector y' is ultimately supplied to the discrimination data calculation unit 1603. iTo calculate it.

[0151] Thus, in Example 2, the coefficient vector x of a certain sample i By pre-rearranging the neural network, it becomes possible to calculate the importance of each feature even when the neural network is multi-layered, and the sample (coefficient vector x n For each sample (coefficient vector x), explanation can be simplified with high accuracy and efficiency. n By pre-arranging the elements, a linear discriminant model can be obtained, allowing for the calculation of discriminant values ​​with high accuracy and low overhead.

[0152] Furthermore, the acquisition unit 601 of Embodiment 2 includes a conversion unit 1601, a rearrangement unit 1602, and an importance calculation unit 1604. Therefore, the coefficient vector (x i , x' i By pre-arranging the elements, a linear discrimination model PM2 is obtained, allowing for the calculation of discrimination values ​​with high accuracy and low overhead.

[0153] Also, the coefficient vector (x i , x' i The characteristics of the feature vector (x) can be understood from the importance calculation unit 1604's importance for each level l. This allows the feature vector (x) given to the neural network to be analyzed. i , x' i This allows for highly accurate and efficient simplification of explanations regarding ).

[0154] Furthermore, since the acquisition unit 601 has a discrimination data calculation unit 1603, the coefficient vector (x i , x' i ) The subject of analysis is the discrimination result from neural network 1500 (y i , y' i This allows for a highly accurate and efficient simplification of explaining the reasons for obtaining the result.

[0155] Furthermore, since the acquisition unit 601 includes a setting unit 1605 and an integration unit 1606, the discrimination data calculation unit 1603 can calculate discrimination results based on the rearrangement results with high accuracy.

[0156] Furthermore, the acquisition unit 601 has a reduction unit 1607, which allows for more efficient data analysis through dimensionality reduction. In addition, the acquisition unit 601 can construct a highly accurate discrimination model by learning the learning parameters.

[0157] It should be noted that the present invention is not limited to the embodiments described above, but includes various modifications and equivalent configurations within the spirit of the attached claims. For example, the embodiments described above are described in detail to make the present invention easier to understand, and the present invention is not necessarily limited to having all of the described configurations. Furthermore, some of the configurations of one embodiment may be replaced with those of another embodiment. Furthermore, some of the configurations of one embodiment may be added to those of another embodiment. Furthermore, some of the configurations of each embodiment may be added, deleted, or replaced with other configurations.

[0158] Furthermore, each of the aforementioned configurations, functions, processing units, and processing means may be implemented in hardware, for example, by designing them as integrated circuits, or they may be implemented in software by having a processor interpret and execute programs that realize each function.

[0159] Information such as programs, tables, and files that implement each function can be stored in memory, hard disks, SSDs (Solid State Drives), or on recording media such as IC (Integrated Circuit) cards, SD cards, and DVDs (Digital Versatile Discs).

[0160] Furthermore, the control lines and information lines shown are those deemed necessary for explanation purposes and do not necessarily represent all control lines and information lines required for implementation. In reality, it can be assumed that almost all components are interconnected. [Explanation of Symbols]

[0161] 100 Analysis Systems 101 Analysis equipment 102 Terminal device 103 Network 200 calculator 201 Processor 202 Storage Devices 302 amino acid sequence 412 Learning Department 414 Discrimination part 601 Acquisition Department 602 Classification Department 603 Specific part 604 Judgment section 605 Output section 1500 Neural Networks 1601 Conversion Unit 1602 Relocation section

Claims

1. An analysis apparatus having a processor for executing a program and a storage device for storing the program, The aforementioned processor, For each of the multiple characteristics of the object of analysis, which is represented by discrete data having multiple discrete values, an acquisition process is performed to acquire coefficient data of a linear discrimination model that discriminates the characteristic for each of the discrete data. A classification process is performed to classify the coefficient data obtained for each discrete data point by the acquisition process into multiple clusters according to the characteristics, A process for identifying discrete data corresponding to the discrimination model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process, An output process that outputs the discrete data identified by the specified process, An analysis device characterized by performing the following actions.

2. The analysis apparatus according to claim 1, In the classification process, the processor controls the number of clusters based on the multiple coefficient data belonging to each cluster for each characteristic. An analytical device characterized by the following features.

3. The analysis apparatus according to claim 2, In the classification process, the processor generates representative coefficient data for a representative discriminant model that represents multiple discriminant models of the multiple coefficient data, based on the multiple coefficient data belonging to the cluster for each of the multiple characteristics, and controls the number of the multiple clusters based on the multiple representative coefficient data. An analytical device characterized by the following features.

4. The analysis apparatus according to claim 3, The aforementioned processor, Based on a first representative discriminant model for the first cluster of the first characteristic and a second representative discriminant model for the second cluster of the second characteristic which is different from the first characteristic, a determination process is performed to determine the features of the geometric structure near the Pareto front. In the output process, the processor outputs the determination result determined by the determination process. An analytical device characterized by the following features.

5. The analysis apparatus according to claim 4, In the determination process, the processor determines the possibility of singularities in the coefficient data belonging to the first cluster and the second cluster. An analytical device characterized by the following features.

6. The analysis apparatus according to claim 4, In the determination process, the processor determines the independence of the first coefficient of the first representative discriminant model and the second coefficient of the second representative discriminant model. An analytical device characterized by the following features.

7. The analysis apparatus according to claim 1, In the output processing, the processor accepts the selection of two or more characteristics from the plurality of characteristics and outputs the discrete data corresponding to the coefficient data belonging to each cluster of the two or more selected characteristics. An analytical device characterized by the following features.

8. The analysis apparatus according to claim 7, In the output processing, the processor outputs the discrete data on a space with each of the two or more selected characteristics as an axis. An analytical device characterized by the following features.

9. The analysis apparatus according to claim 1, In the acquisition process, the processor uses a neural network comprising an input layer, an output layer, and two or more intermediate layers between the input layer and the output layer that input data from the preceding layer and a first learning parameter to an activation function, calculate the result, and output the result to a subsequent layer. A transformation process is performed to convert each output data to the same number of dimensions based on the output data from each intermediate layer and the second learning parameter, and then output the transformed output data. A rearrangement process is performed to rearrange the coefficient data in a second space based on the output data after conversion from the conversion process and the coefficient data in a first space represented by the dimension of the number of coefficients given to the input layer. An analysis device characterized by performing the following actions.

10. An analysis method performed by an analysis apparatus having a processor for executing a program and a storage device for storing the program, The aforementioned processor, For each of the multiple characteristics of the object of analysis, which is represented by discrete data having multiple discrete values, an acquisition process is performed to acquire coefficient data of a linear discrimination model that discriminates the characteristic for each of the discrete data. A classification process is performed to classify the coefficient data obtained for each discrete data point by the acquisition process into multiple clusters according to the characteristics, A process for identifying discrete data corresponding to the discrimination model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process, An output process that outputs the discrete data identified by the specified process, An analysis method characterized by performing the following.

11. In the processor, For each of the multiple characteristics of the object of analysis, which is represented by discrete data having multiple discrete values, an acquisition process is performed to acquire coefficient data of a linear discrimination model that discriminates the characteristic for each of the discrete data. A classification process is performed to classify the coefficient data obtained for each discrete data point by the acquisition process into multiple clusters according to the characteristics, A process for identifying discrete data corresponding to the discrimination model of the coefficient data belonging to two or more clusters with different characteristics classified by the classification process, An output process that outputs the discrete data identified by the specified process, An analysis program characterized by executing the following: