The present invention relates to methods and algorithms that can be used to identify sequence motifs that are either under- or over-represented in a given nucleotide sequence as compared to the frequency of those sequences that would be expected to occur by chance, or that are either under- or over-represented as compared to the frequency of those sequences that occur in other nucleotide sequences, and to methods of scoring sequences based on the occurrence of these sequence motifs. Such sequence motifs may be biologically significant, for example they may constitute transcription factor binding sites, mRNA stability/instability signals, epigenetic signals, and the like. The methods of the invention can also be used, inter alia, to classify sequences or organisms in terms of their phylogenetic relationships, or to identify the likely host of a pathogenic organism. The methods of the present invention can also be used to optimize expression of proteins.