Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese text keyword extraction method based on document theme structures and semantics

A technology of document subject and extraction method, which is applied in the field of keyword extraction, can solve the problems of spending more energy and different ability of keyword summary and generalization, and achieve the effect of improving and improving the effect

Active Publication Date: 2018-06-22
厦门纵横集团科技股份有限公司
View PDF4 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the different knowledge reserves, understanding of keywords, and ability to summarize and generalize, the annotators have strong subjectivity, and the extracted keywords are not the same.
What's more, using manpower to mark keywords in the text will take more energy to read and understand the text content, which obviously cannot satisfy the current situation that the number of information resources is constantly doubling

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese text keyword extraction method based on document theme structures and semantics
  • Chinese text keyword extraction method based on document theme structures and semantics
  • Chinese text keyword extraction method based on document theme structures and semantics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The following embodiments will further illustrate the present invention in conjunction with the accompanying drawings.

[0027] The present invention comprises the following steps:

[0028] 1) Text preprocessing steps:

[0029] The text documents used mainly come from various types of data such as web pages, PDF, Word, etc. The preprocessing process is divided into two aspects, one is the preprocessing of web pages, and the other is the preprocessing of other text types;

[0030] Preprocessing for webpages: Preprocessing these news webpages aims at extracting corresponding titles, content and marked keywords from them. By writing the extracted rules and conditional filtering, the web pages are structured and extracted, and saved in the form of text. Different websites have mostly different templates for their web pages. Through website research, every news article provided in Sina News.com will provide artificially marked keywords, which can better reflect news conte...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese text keyword extraction method based on document theme structures and semantics, and relates to keyword extraction. The method includes the steps: text preprocessing;Chinese segmentation and part-of-speech tagging; stop word filtering and part-of-speech filtering; keyword extraction. The basic conception of text keyword extraction, Chinese segmentation and English segmentation differences and a common Chinese text keyword extraction method are introduced. A method based on the document theme structures and a method based on semantics are researched, and the principle and an existing implementation scheme are analyzed. In order to overcome difficulty in new word identification in Chinese segmentation, Chinese segmentation effects are continuously improvedby the aid of a dynamically updated segmentation dictionary. The method based on the document theme structures is improved, and global keywords are extracted. Semantic similarities of Chinese words are taken into account, and an algorithm is further improved. The improved algorithm is verified in a self-built data set, good results are acquired by verification experiments and comparison experiments, and keyword extraction effects can be improved by the improved algorithm.

Description

technical field [0001] The invention relates to keyword extraction, in particular to a Chinese text keyword extraction method based on document topic structure and semantics. Background technique [0002] Entering the 21st century, with the continuous advancement of science and technology and the rapid development of the Internet, various information resources have doubled and rapidly increased. People are eager to quickly and accurately find information that is really useful to them from huge information sources. Keywords can highly summarize the content of the document and reflect the theme of the document, providing powerful help for people to find resources. [0003] In a document, a keyword is an in-depth refinement of the content of the document, which is generally represented by several words or phrases. Through the keywords of the document, you can gain insight into the main content described in the document and quickly determine whether it is a required resource. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/284G06F40/30
Inventor 王晓黎林坤辉邱明王美红潘洋彬杜文源高楚楚
Owner 厦门纵横集团科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products