In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Search processing with automatic categorization of queries

Search results are processed using search requests, including analyzing received queries in order to provide a more sophisticated understanding of the information being sought. A concept network is generated from a set of queries by parsing the queries into units and defining various relationships between the units. From these concept networks, queries can be automatically categorized into categories, or more generally, can be associated with one or more nodes of a taxonomy. The categorization can be used to alter the search results or the presentation of the results to the user. As an example of alterations of search results or presentation, the presentation might include a list of “suggestions” for related search query terms. As other examples, the corpus searched might vary depending on the category or the ordering or selection of the results to present to the user might vary depending on the category. Categorization might be done using a learned set of query-node pairs where a pair maps a particular query to a particular node in the taxonomy. The learned set might be initialized from a manual indication of which queries go with which nodes and enhanced has more searches are performed. One method of enhancement involves tracking post-query click activity to identify how a category estimate of a query might have varied from an actual category for the query as evidenced by the category of the post-query click activity, e.g., a particular hits of the search results that the user selected following the query. Another method involved determining relationships between units in the form of clusters and using clustering to modify the query-node pairs.

Organizing structured and unstructured database columns using corpus analysis and context modeling to extract knowledge from linguistic phrases in the database

Corpus analysis methods have previously been applied to text, typically to annotated text. The invention shows how to apply corpus analysis methods to information captured in databases, where the database columns include a mixture of both structured domains and unstructured domains containing text. It uses case-based methods to automatically organize cases for periodic review. The invention can help to identify opportunities for increasing knowledge about databases. By organizing a database around common lexical, semantic, pragmatic and syntactic relationships, the invention can be used to increase the effectiveness of previous corpus analysis methods, and to apply them to a diversity of commercial applications. The invention applies contextual constraints to focus the application of linguistic methods. This invention can provide a component for medical records, enterprise databases, information retrieval, question answering systems, interactive robots, interactive appliances, linguistically competent speech recognition, speech understanding and many other useful devices and applications that require a high level of linguistic competence within operational contexts.
