A term extraction method and system for academic papers

A terminological and academic technology, applied in the field of terminology extraction for academic papers, can solve the problems of ignoring terminology, low terminology accuracy rate, and reducing the quality of terminology extraction, so as to improve accuracy rate and recall rate, improve screening accuracy rate, and improve The effect of accuracy

Active Publication Date: 2019-04-12
WUHAN SHUWEI TECH
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] The existing linguistic term extraction method based on Chinese word formation rules extracts and filters candidate terms for the entire free text, and summarizes the part-of-speech matching template of the term by analyzing the collocation and occurrence rules of different words, and then uses the template to extract Extract candidate terms; among them, the characteristics of terms between different types of text blocks are ignored, the position information of terms cannot be fully utilized, and the quality of term extraction is reduced; on the other hand, in the case of some specific corpora, such as academic papers Contains different text blocks, such as titles, abstracts, keywords, etc., and the distribution of terms in different text blocks is different. Using the same term extraction method for the entire academic paper makes the accuracy of term extraction not high
[0003] After the candidate term base is extracted, it needs to be screened to obtain the correct term; there are many filtering methods, mainly through the unit of term and field correlation; for example: TF-IDF (term frequency-inverse document frequency, TermFrequency- Inverse Document Frequency) method is to use the frequency of candidate terms appearing in this article and the frequency in the entire corpus to judge the domain relevance of candidate terms; SCP (Symmetrical Conditional Probability, symmetric conditional probability) is used to judge the collocation rationality of compound terms ; C-VALUE (C value, Content-Value) is used to judge the field relevance of composite terms; these methods have a good effect on the screening of candidate terms, but under some specific corpora, such as academic papers, their The category attribute is very clear, but these methods do not take this attribute into account, do not use category information when screening terms for academic papers, and do not fully consider the domain relevance of candidate terms, resulting in low term extraction accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A term extraction method and system for academic papers
  • A term extraction method and system for academic papers
  • A term extraction method and system for academic papers

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0062] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

[0063] The term extraction method for academic papers provided by the embodiment of the present invention, its process is as follows figure 1 As shown, it includes a preprocessing step, a candidate term extraction step, a candidate term screening step and a candidate term comprehensive scoring and sorting step; the details are as follows:

[0064] (1) Preprocessing step: its flow process is as ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a term extraction method and system oriented to academic paper. The term extraction method comprises the following steps: performing preprocessing, including text block annotation, text block screening, word segmentation and part-of-speech tagging and noise word removal, on an academic paper corpus; performing candidate term extraction on a title, an abstract and keyword text blocks to form a candidate term set; screening and filtering word type terms and compound word type terms in the candidate term set respectively to obtain a new candidate term set; and determining weights of positions according to position information of candidate terms by an analytic hierarchy process, performing overall rating, ranking the candidate terms according to ratings, and taking TopN candidate terms or candidate terms of which the ratings are greater than a threshold value as extracted terms. Through adoption of the term extraction method and system, the term distribution characteristic of the academic paper and class information of the academic paper are considered fully, and the accuracy and recall rate of term extraction of the academic paper are increased.

Description

technical field [0001] The invention belongs to the technical field of computer natural language processing or pattern recognition, and more specifically relates to a term extraction method for academic papers. Background technique [0002] The existing linguistic term extraction method based on Chinese word formation rules extracts and filters candidate terms for the entire free text, and summarizes the part-of-speech matching template of the term by analyzing the collocation and occurrence rules of different words, and then uses the template to extract Extract candidate terms; among them, the characteristics of terms between different types of text blocks are ignored, the position information of terms cannot be fully utilized, and the quality of term extraction is reduced; on the other hand, in the case of some specific corpora, such as academic papers Contains different text blocks, such as titles, abstracts, keywords, etc., and the distribution of terms in different text...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27
CPCG06F40/216G06F40/30
Inventor 郑胜蒋丹徐涛张胜周可夏明
Owner WUHAN SHUWEI TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products