Unlock instant, AI-driven research and patent intelligence for your innovation.

Corpus expansion system and method thereof

a corpus expansion and corpus technology, applied in the field of information extraction, knowledge mining and other natural language processing applications, to achieve the effects of improving the quality of pre-tagging corpus, reducing the cost of operation, and improving the coverage of corpus

Inactive Publication Date: 2007-03-29
IBM CORP
View PDF12 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The invention provides a system and method for automatically expanding a corpus by expanding new sample seeds. This allows for the creation of a larger and more comprehensive corpus of data, improving coverage and quality. The system includes a corpus collection unit, a sample seed expansion unit, a balancing unit, and a refining unit. The method involves collecting corpus, generating new sample seeds, determining a corpus expansion strategy, and refining the new sample seeds. Overall, the invention provides a cost-effective and convenient way to expand the corpus.

Problems solved by technology

However, in these existing methods, the corpus coverage is completely dependent on the limited initial sample seeds.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus expansion system and method thereof
  • Corpus expansion system and method thereof
  • Corpus expansion system and method thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The method and system according to the invention will be described with the named entity recognition as example. However, it will be apparent for the persons in the art that the method and system according to the invention can be applied to other similar fields such as nominal entity recognition, relationship recognition and information extraction.

[0022] The terms used in the invention will be explained first.

[0023] Specific field: the particular field where corpus is collected, such as financial field, sports field, and entertainment field. Named entity class under specific field (hereinafter named as class): the class having practical meaning defined under the specific field when collecting corpus for the specific field. For example, the classes under the banking field are classified as bank name class, representative name class, city name class.

[0024] Named entity: a word sequence representing an entity name that has the practical meaning in each of the classes under sp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A system and method for expanding new sample seeds to automatically expand corpora, in which sample seeds are used to collect corpus is provided. The new sample seeds are generated based on the already existed sample seeds and collected corpora; The corpus expansion strategy is determined based on all the sample seeds having been used and new sample seeds: The new sample seeds are refmed based on the corpus expansion strategy, and the refmed new sample seeds are used to further collect corpus. The above steps are repeatedly executed until predefined condition is satisfied. According to the invention, corpus may be automatically expanded from the web or other resources with low cost and in convenient way to improve the coverage of corpora.

Description

FIELD OF THE INVENTION [0001] The present invention relates to the field of information extraction, knowledge mining and other natural language processing applications, especially to a corpus expansion system and method for expanding corpus based on which the machine learning method is executed. BACKGROUND OF THE INVENTION [0002] Typically, the corpora collected manually or automatically are analyzed with the machine learning method to generate the classifier models of a certain specific class to be used in information extraction, knowledge mining and other natural language processing applications. [0003] In the task-oriented or domain-oriented natural language processing applications, such as domain-specific information extraction and named entity recognition, collecting corpora with extensive coverage and tagging the collected corpora are the important factors for improving the recognition accuracy. [0004] There exist some methods for automatically collecting and tagging corpus. I...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/27G06F40/00
CPCG06F17/30731G06F17/2715G06F16/36G06F40/216
Inventor GUO, HONG LEIZHANG, LIQIU, ZHAO MINGSHEN, LI QINGUO, ZHI LI
Owner IBM CORP