Character data recognition and processing method and device

A technology of character data and data, which is applied in the field of computer data retrieval, can solve the problems of large recognition errors of character data, achieve the effect of overcoming inaccurate recognition and improving recognition accuracy

Inactive Publication Date: 2011-06-22
PEKING UNIV +3
View PDF0 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] The present invention aims to provide a method and device for identifying and processing character data, which can

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Character data recognition and processing method and device
  • Character data recognition and processing method and device
  • Character data recognition and processing method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0033] see figure 1 The method for character data identification and processing according to the embodiment of the present invention mainly includes the following steps:

[0034] S11: Identify the characteristic character data according to the benchmark corpus and the benchmark template, and obtain different entity names corresponding to each named entity;

[0035] S12: Obtain the characteristic prefix frequency of each entity name;

[0036] S13: Identify the character data to be processed according to the feature suffix frequency, the reference template and the predefined corpus, and obtain different entity names corresponding to the respective named entities;

[0037] S14: Perform subsequent analysis processing with the entity name identified from the character data to be processed as a data parameter.

Embodiment 2

[0040] The second embodiment of the method of the present invention is set forth below. The present invention can be applied in various character data, such as language symbol data, mathematical symbol data, logic symbol data, etc. of Chinese or other countries, and identify them in units of words or characters. After processing, the embodiment provided by the present invention takes Chinese news comments as an example to illustrate, for example, input a news web page, after the news title, news text and related comment collections can be correctly extracted, the news text and each comment can be fed back Perform corresponding data processing on the recognition results of person names, place names, and institution names in the database.

[0041] Embodiment 2 is illustrated by taking webpage text data as an example. For example, named entity recognition is performed on news data in webpage text data. The most important is the named entity recognition in news comments, which rec...

Embodiment 3

[0110] Figure 4 A structural diagram of the device of the present invention is shown. Such as Figure 4 As shown, the device for identifying and processing character data according to an embodiment of the present invention includes:

[0111] 1) Recognition unit 40 is used to identify the characteristic character data according to the reference corpus and the reference template, and obtain the different entity names corresponding to each named entity; Identifying the character data of each named entity to obtain different entity names corresponding to each named entity;

[0112] 2) Statistical unit 41, for obtaining the characteristic suffix frequency number of each entity name identified by said recognition unit 40 from the characteristic character data;

[0113] 3) The processing unit 42 is configured to use the entity name identified from the character data to be processed as a data parameter to perform subsequent analysis processing.

[0114] Preferably, the identifica...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a character data recognition and processing method and a character data recognition and processing device. The method provided by the invention comprises the following steps of: recognizing featured character data according to reference linguistic data and a reference template, and obtaining different entity names corresponding to each named entity respectively; obtaining the feature affix frequency of each entity name; recognizing character data to be processed according to the feature affix frequency, the reference template and predefined linguistic data to obtain the different entity names corresponding to each named entity respectively; and performing subsequent analysis processing by taking the entity names recognized from the character data to be processed as data parameters. Feature affixes form a recognition feature column, so the method and the system solve the problem of relatively greater predefined character data recognition errors in post-retrieval and translation, improve the recognition accuracy of the named entities and avoid freely or insufficiently normally expressed named entities not being recognized or being recognized by error.

Description

technical field [0001] The present invention relates to the technical field of computer data retrieval, in particular to a method and device for identifying and processing character data. Background technique [0002] The Internet has developed rapidly since its birth in the early 1990s, and its information release is mainly realized in the form of web pages. According to the latest estimates, the number of web pages in the Internet has exceeded 550 billion (a billion is equal to 1 billion), and the Internet, as the world's largest information warehouse, covers all fields of the real world. Faced with such massive information sources, people urgently need some automated tools to help them quickly find the really important information, so information extraction research came into being. The main purpose of information extraction is to transform unstructured text into structured or semi-structured information, and store it in a specific form for user query or further analysis...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
Inventor 赵立红万小军吴於茜杨建武肖建国
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products