Method for automatically generating regular expressions for relaxed matching of text patterns

a text pattern and automatic generation technology, applied in the field of automatic generation of regular expressions for relaxed matching of text patterns, can solve the problems of reducing the accuracy of information retrieval, limiting the ability of synonym dictionary, and hampered by restricted usability of known regular expression generation tools

Inactive Publication Date: 2009-03-12
IBM CORP
View PDF2 Cites 69 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0010]Advantageously, the present invention provides a technique for automatically generating regular expressions for a relaxed matching of text patterns. Further, the present invention provides a generic, extensible, and widely applicable rule-based framework in which the automatic generation of regular expressions is based on the creation and updating of rules without requiring the writing and maintenance of complex and customized software programs.

Problems solved by technology

Being based on a natural language dictionary (e.g., standard English dictionary), the synonym dictionary is limited in its ability to match certain text pattern variations related to punctuation, spacing, new lines between words, arbitrary capitalization, colloquial abbreviations, etc.
Further, known query processing techniques that employ stemming and stop word removal decrease precision in information retrieval results.
These known regular expression generation tools are hampered by restricted usability because their users are required to have knowledge of the formulation and usage of syntactic constructs in regular expressions.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically generating regular expressions for relaxed matching of text patterns
  • Method for automatically generating regular expressions for relaxed matching of text patterns
  • Method for automatically generating regular expressions for relaxed matching of text patterns

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

1 Overview

[0026]The goal of information extraction (IE) is to extract structured information from unstructured text (a.k.a. plain text) (e.g., documents, files, emails, web pages, etc.). In rule-based IE, rules are written that describe textual patterns of interest, which are to be extracted from unstructured text. Regular expressions are used for expressing such textual patterns of interest. As used herein, a regular expression is defined as a compact representation that describes a set of strings without listing all the elements of the set. A regular expression matches each of the strings in the set.

[0027]For example, consider the information extraction task of identifying text patterns that associate a person with his or her phone number. A text pattern of interest for this example is the phrase “can be reached at”. Using such a pattern, a rule-based IE system identifies occurrences of the form “ can be reached at ” and generates the corresponding pairs of related Persons and Pho...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for automatically generating regular expressions for relaxed matching of text patterns. A received input phrase expressed in a natural language is determined to be a plain text pattern. The plain text pattern is automatically tokenized, thereby generating a first token list. Rules loaded from a predefined rule set are automatically applied to the first token list in an order specified by the predefined rule set to automatically modify a token list by applying a replace word, split-at-character or whitespace operator. The modified token list is automatically converted into a regular expression that matches the plain text pattern and one or more variations of the plain text pattern. A utilization of the regular expression for an information extraction facilitates a recall and a precision of the information extraction.

Description

FIELD OF THE INVENTION[0001]The present invention relates to a method and system for automatically generating regular expressions for relaxed matching of text patterns.BACKGROUND OF THE INVENTION[0002]One category of information extraction employs query expansion and other query processing techniques in search engines. Conventional query expansion techniques generate an expanded output query from an original query, where the expanded output query includes additional words obtained from a synonym dictionary. The results of the expanded output query are documents that contain either the keywords of the original query or the additional words from the synonym dictionary. Being based on a natural language dictionary (e.g., standard English dictionary), the synonym dictionary is limited in its ability to match certain text pattern variations related to punctuation, spacing, new lines between words, arbitrary capitalization, colloquial abbreviations, etc. Further, known query processing te...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F17/30672G06F17/30654G06F16/3338G06F16/3329
Inventor LOESER, ALEXANDER STEPHANRAGHAVAN, SRIRAMVAITHYANATHAN, SHIVAKUMAR
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products