Method for identifying Web named entity based on statistical model

A named entity recognition and named entity technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of insufficient recognition accuracy and accuracy, and achieve the effects of optimizing computational complexity, improving recognition accuracy, and improving efficiency

Inactive Publication Date: 2012-01-11
XIDIAN UNIV
View PDF0 Cites 39 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The main problem to be solved by the present invention is the recognition of existing Web Chinese named entities, e

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying Web named entity based on statistical model
  • Method for identifying Web named entity based on statistical model
  • Method for identifying Web named entity based on statistical model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0049] The present invention is a method for identifying named entities based on statistical models, which mainly preprocesses Web documents on web pages, and provides basic guarantee for subsequent information extraction, machine translation and question and answer systems.

[0050] Taking recruitment network as an example, the present invention uses statistical models to identify named entities for recruitment information on the Web. The named entities in the recruitment information are mainly four types of entities: location, time, organization, and position. The experimental process of identification is as follows figure 1 Shown. The experimental data in Table 1 of this example comes from Zhaolian recruitment webpages, which include six types of recruitment webpages including computer, biomedicine, construction, environmental protection, machinery and chemical engineering, and secretarial. Entity extraction of position name, recruitment agency name, work location, and recruit...

Embodiment 2

[0077] The method of Web named entity recognition based on the statistical model is the same as in Example 1. The named entity feature extraction in step 2 of the present invention is further explained:

[0078] (1) Structural feature vector of Web named entity analyse as below:

[0079] Since named entities in web pages are usually displayed in an emphasized manner, this feature can be taken into consideration when recognizing. For example, when the job name is displayed in a large red font, the display method is obviously different from other text; this feature of Web named entities is mainly used to emphasize some important information, and it is also convenient for users to browse requirements.

[0080] First, express the display style of the Web named entity on the web page to form a feature vector

[0081] Structural characteristics refer to the display style of Web objects, and the Cascading Style Sheet (CSS) attributes of the Web are introduced to describe the structural cha...

Embodiment 3

[0113] The web named entity recognition method based on the statistical model is the same as in Example 1-2. The test compares the effect of the MR-GHMM method of selecting multiple features and single feature:

[0114] The identification effect evaluation standard of the present invention is:

[0115] When comparing the recognition effects of different entities, the present invention uses the recall rate and the precision rate as the evaluation standard, and considers the precision rate and the recall rate, namely: the weighted geometric average F of the recall rate and the precision rate.

[0116] (1) The precision is equal to the number of correct answers produced by the system divided by the number of all answers produced by the system.

[0117] (2) The recall rate is equal to the number of correct answers produced by the system divided by the number of all possible answers in the text (including those obtained by the system and those that the system should not ignore).

[0118] ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for identifying a Web named entity based on a statistical model. The method comprises the following steps of: representing multiple characteristics of the Web named entity with structure and text characteristics; combining a statistical method with a rule method and adopting an improved MR-GHMM (MR-Generalized Hidden Markov Model) to increase the training efficiency; marking the entity with the improved GHMM, and marking each named entity to realize entity identification; and processing a Web complex named entity identifying process on two layers and performing complex nested entity identification by taking a marking result of a first layer as the input of second layer processing. Compared with an original identifying algorithm, the method has the advantages that: the identifying accuracy of an algorithm used in the method is increased, and the time complexity of model training is lowered greatly. By representing multiple characteristics of the Web named entity and modifying entity characteristics in different fields, named entities in different fields on Web can be identified.

Description

Technical field [0001] The invention belongs to the technical field of natural language processing, and mainly relates to the field of Web information extraction, in particular to the recognition of Web named entities. Specifically, it is a method of identifying Web named entities based on statistical models, which is mainly used to identify Web named entities and realize the acquisition and preprocessing of web page information. Background technique [0002] Web named entity recognition technology is mainly used to obtain the most basic data of information on Web pages. By obtaining data, the content of web pages can be identified, and various subsequent applications, such as information extraction, automatic question and answer, and translation, all need the support of named entity recognition technology, which is also a basic task in natural language processing. With the rapid development of network technology and its wide application in various fields, research on it is very...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
Inventor 王静刘志镜曲建铭王燕贺文华王炜华王纵虎陈东辉姚勇朱旭东赵辉
Owner XIDIAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products