A universal data-mining platform capable of analyzing
mass spectrometry (MS) serum proteomic profiles and / or
gene array data to produce biologically meaningful classification; i.e., group together biologically related specimens into clades. This platform utilizes the principles of phylogenetics, such as parsimony, to reveal susceptibility to
cancer development (or other physiological or pathophysiological conditions), diagnosis and
typing of
cancer, identifying stages of
cancer, as well as post-
treatment evaluation. To place specimens into their corresponding
clade(s), the invention utilizes two algorithms: a new data-mining
parsing algorithm, and a publicly available phylogenetic
algorithm (MIX). By outgroup comparison (i.e., using a normal set as the standard reference), the
parsing algorithm identifies under and / or overexpressed
gene values or in the case of sera, (i) novel or (ii) vanished MS peaks, and peaks signifying (iii) up or (iv) down regulated proteins, and scores the variations as either derived (do not exit in the outgroup set) or ancestral (exist in the outgroup set); the derived is given a
score of “1”, and the ancestral a
score of “0”—these are called the polarized values. Furthermore, the shared derived characters that it identifies are
potential biomarkers for cancers and other conditions and their subclasses.