A
system for discovering data artifacts in an on-line data object is described. One embodiment includes a
data acquisition subsystem configured to parse the on-line data object into at least one string; a string pre-parser configured to divide each string into a set of separate characters; a lexical analyzer configured, for each set of separate characters, to aggregate the separate characters in that set of separate characters into a sequence of tokens, each token in the sequence of tokens being one of a word, a
punctuation symbol, a
HyperText-Markup-Language tag, and a number; a
syntax analyzer configured, for each sequence of tokens during a first analysis phase, to determine, for each of a plurality of
rule sets, whether the sequence of tokens includes one or more candidate data artifacts of a distinct type to which that rule set corresponds, each of the plurality of
rule sets being adapted to discovery of the distinct type of data artifact to which that rule set corresponds, at least one rule set in the plurality of
rule sets including a context-free grammar; compute, for each candidate data artifact of a distinct type, a probability
ranking indicating a degree of likelihood that the candidate data artifact is a data artifact of that distinct type; and classify each candidate data artifact as a data artifact of the distinct type for which a most favorable probability
ranking was computed for that candidate data artifact, the
syntax analyzer being configured to associate with each classified data artifact a subject found within the on-line data object; and a storage subsystem including at least one
data structure in which to store the classified data artifacts, the storage subsystem being configured to index and organize the classified data artifacts by subject for retrieval in response to a search query indicating a particular subject.