Method for automatic term identification of Chinese patent literature

A patent document and automatic identification technology, which is applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem of short recognition length, high dependence on linguistic knowledge of writers, unsatisfactory recognition effect of short term terms, etc. problem, to achieve high reliability

Active Publication Date: 2016-01-06
BEIJING INFORMATION SCI & TECH UNIV
View PDF10 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The first is a term recognition method that combines traditional rules and statistics: in the process of generating a candidate term set, the Chinese text is first segmented and part-of-speech tagged, and the part-of-speech rule set that constitutes the term is summarized by observing the tagged corpus , use these part-of-speech rules to match in the corpus to generate candidate term sets; although the method of manually writing part-of-speech rules has high recognition accuracy, it relies too much on the writer's linguistic knowledge, and different people write part-of-speech rules for the same corpus Not consistent; although these methods do not need to use part-of-speech rules in the stage of obtaining candidate terms, they are too dependent on external resources when roughly segmenting sentences, and the quality of external resources often determines the quality of the obtained candidate term set; In terms of sorting candidate term sets, the currently commonly used sorting algorithms have the disadvantage of not being ideal for identifying terms with shorter lengths or terms with lower frequencies;
[0006] The second way to identify terms is to use machine learning algorithms that have gradually become research hotspots in the field of information extraction in recent years. The disadvantages of machine learning algorithms are that they have high requirements for the size and quality of training corpus, and require manual labeling of large amounts of data. The training of the corpus also takes a long time
[0007] In addition, the current mainstream candidate term ranking algorithm is not ideal for short term recognition.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatic term identification of Chinese patent literature
  • Method for automatic term identification of Chinese patent literature
  • Method for automatic term identification of Chinese patent literature

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0031] like figure 1 As shown, a method for automatic recognition of Chinese patent literature terms, comprising the following steps:

[0032] Step 1): Automatically generate part-of-speech rules based on the patent title, use the Chinese lexical analysis system to segment the patent title into substrings and stop words, and use the stop words as separators to extract the part-of-speech rules of the substrings , and use it as a part-of-speech rule for generating candidate terms;

[0033] Patent documents are generally records of inventions, utility models, and designs, and their titles are a high-...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for automatic term identification of a Chinese patent literature, which comprises the steps that at a step (1), a word nature rule is automatically generated based on a patent title; at a step (2), a stop word list is generated manually; at a step (3), the generated word nature rule is classified according to number of contained word natures; and at a step (4), candidate terms are sequenced according to a TermRank sequencing algorithm. The method for the automatic term identification of the Chinese patent literature provided by the invention firstly automatically learns the word nature rule constituting the term from the patent title by a statistic method in order to solve the deficiency in manual summarization of term word nature rules. The method sequences the candidate terms by the TermRank sequencing method, takes comprehensive account of linguistic and statistic characteristics in patent literatures, can effectively distinguish terms from non-terms, has high reliability and can satisfy demands of actual applications very well.

Description

technical field [0001] The invention belongs to the technical field of automatic recognition of Chinese terms, and in particular relates to a method for automatic recognition of terms in Chinese patent documents. Background technique [0002] Chinese patent documents contain a large number of domain terms, and automatic recognition of these terms is an important task in the fields of information extraction and text mining. Automatic Term Recognition (ATR) is an important part of the research field of information extraction. It refers to the process of automatically identifying lexical strings that can represent general concepts in a professional field from free texts without human intervention or with as little human intervention as possible. The term base constructed by automatic term recognition technology is a very important basic data resource, providing indispensable data support for Chinese word segmentation, ontology construction, dictionary compilation and update, a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 吕学强董志安
Owner BEIJING INFORMATION SCI & TECH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products