Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Geographic text corpus labeling method based on feature evaluation and keyword similarity

A technology of corpus annotation and similarity, applied in special data processing applications, unstructured text data retrieval, semantic tool creation, etc., can solve the problem of low system data accuracy, poor efficiency of geographic corpus system construction, and corpus data errors affecting work to reduce manpower and material resources, improve accuracy and recall rate, and improve quality

Pending Publication Date: 2021-11-05
UNIV OF ELECTRONICS SCI & TECH OF CHINA +1
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In order to solve the problems that the current geographical corpus system construction efficiency is poor, and the system data accuracy is not high, and it is easy to affect the normal work due to corpus data errors

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Geographic text corpus labeling method based on feature evaluation and keyword similarity
  • Geographic text corpus labeling method based on feature evaluation and keyword similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] The present invention will be further described below in conjunction with the accompanying drawings and embodiments, and the present invention includes but not limited to the following embodiments.

[0024] The present invention comprises the following steps:

[0025] S1: Use crawler technology to crawl the structured text information of Baidu Encyclopedia's geographic-related pages as the knowledge base, and crawl the unstructured text information of Baidu Encyclopedia's geographic-related pages as the original corpus;

[0026] S2: Further process the information of relevant webpages, and after crawling the original webpage data, perform (1) delete the unreadable special characters in the text, and uniformly encode the text; (2) change the capitalization of English letters in the text Lowercase, full-width to half-width; (3) Delete hyperlinks and html codes, remove irrelevant content such as advertisements, and delete useless symbols such as corner marks. Finally, the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a geographic text corpus tagging method based on feature evaluation and keyword similarity, and high-quality geographic field tagging corpus is obtained. The method comprises the following steps: crawling web texts by utilizing a crawler technology to obtain a knowledge base and a corpus; preprocessing the corpus to obtain cleaned corpora; aligning the knowledge base and the corpus according to the entity pairs in the text; calculating sentence feature words; calculating weights of the words in the geographic entity pairs; selecting a word with the maximum weight as a relational word; generating a word vector by using a Word2Vec model; calculating the similarity between the relational words in the sentences and the relational words in the knowledge base; and finding out the relational word with the maximum similarity and carrying out corpus tagging to finally obtain a statement tagged with an entity and a relation type.

Description

technical field [0001] The invention belongs to the field of natural language processing, and relates to a geographical corpus labeling method based on feature evaluation and keyword similarity analysis. Background technique [0002] At present, most of the corpus mainly comes from relevant news reports on the Internet and some professional knowledge websites, such as the public Chinese-English relational extraction knowledge base ACE2005, SemEval-2010Task8 and Chinese relational extraction knowledge base Chinese-Literature-NER-RE-Dataset, etc. The data types in these knowledge bases basically include various fields in real life, and are an open domain knowledge base. For professional fields, since professional fields need to design corpus annotation methods according to the characteristics of each field, so These excellent open-field annotation methods and corpus cannot be used for application, resulting in the fact that there are currently no very good corpus annotation me...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/387G06F16/36G06F40/295G06K9/62
CPCG06F16/387G06F16/374G06F40/295G06F18/22
Inventor 罗欣冯倩耿昊天赫熙煦许文波冷庚
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products