Unlock instant, AI-driven research and patent intelligence for your innovation.

Natural language-based topic and keyword extraction method and system

A natural language and extraction method technology, applied in the field of subject and keyword extraction, can solve the problems of difficulty in guaranteeing and evaluating the quality of results, consumption, efficiency, and high quality

Inactive Publication Date: 2017-02-01
JIUYUAN QIANCHANG BEIJING TECH SERVICE CO LTD
View PDF6 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In the process of estimating the probability model, it needs to consume a considerable amount of calculation
At the same time, the calculation results are related to the artificially specified prior probability, and the quality of the results is difficult to guarantee and evaluate
Therefore, there are no small difficulties in efficiency and quality when actually using this scheme to extract domain topic models

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Natural language-based topic and keyword extraction method and system
  • Natural language-based topic and keyword extraction method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0045] Such as figure 1 As shown, a method for extracting topics and keywords based on natural language includes:

[0046] Divide the continuous text into individual words and mark the part of speech;

[0047] Extract the subject and predicate from each word-cut sentence;

[0048] Cluster all subject-predicate dyads to compute the main topic clusters and associated keyword clusters across all corpora.

[0049] After adopting the above scheme, the present invention obtains a theme-keyword set based on subject-predicate binary group clustering, and then describes the public opinion dimension of a specific field, which constitutes a good basis for further quantitative analysis of public opinion.

Embodiment 2

[0051] Embodiment 1 is described in detail, wherein, preferably, the continuous text is segmented into individual words, and the part of speech is marked, including:

[0052] Obtain the input Chinese and English text, and perform word segmentation and part-of-speech tagging on the input Chinese-English text; where the output results are separated by spaces, and the part of speech of each word is marked by the agreed symbol.

[0053] Preferably, the subject and the predicate are extracted from each word-cut sentence, including:

[0054] Extract the subject and predicate from the input sentence sequence, and output the keywords of the subject phrase in each sentence: subject, and the keywords of the predicate phrase: predicate, as well as the formed subject and predicate dyads.

[0055] Preferably, if there is a lack of pronouns and a lack of subjects, appropriate subjects are added according to the context.

[0056] Preferably, all subject-predicate binary groups are clustered...

Embodiment 3

[0063] Such as figure 2 As shown, corresponding to the above method embodiments, the present invention discloses a system for extracting topics and keywords based on natural language, including: a natural language preprocessing subsystem, a subject-predicate extraction subsystem, and a clustering subsystem, wherein ,

[0064] The natural language preprocessing subsystem is used to segment the continuous text into individual words and mark the part of speech;

[0065] The subject-predicate extracting subsystem is used to extract the subject and predicate from each word-cut sentence;

[0066] The clustering subsystem is used to cluster all subject-predicate pairs, and calculate main topic clusters and related keyword clusters in all corpus.

[0067] Preferably, the natural language preprocessing subsystem cuts the continuous text into individual words and marks the part of speech. The specific methods include:

[0068] Obtain the input Chinese and English text, and perform w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a natural language-based topic and keyword extraction method and system. The method comprises the steps of segmenting a continuous text into independent words, and performing part-of-speech annotation; extracting a subject and a predicate from each sentence subjected to word segmentation; and clustering all subject-predicate binary groups, and calculating primary topic clusters and related keyword clusters in all corpora. By adopting the technical scheme, a topic-keyword set is obtained based on the clustering of the subject and predicate binary groups, and the public opinion dimensionality of a specific field is described, so that a good foundation is laid for further quantitative analysis of public sentiments.

Description

technical field [0001] The invention belongs to the field of the Internet, in particular to a method and system for extracting topics and keywords based on natural language. Background technique [0002] The potential information contained in the massive text data on the Internet has always been a hot spot in the application of natural language processing and data mining, and summarizing and counting the topics and key keywords contained in a large number of natural language texts is even more important in applications such as public opinion analysis and user word-of-mouth analysis. play an indispensable key role. However, extracting topics and keywords from text with efficiency and accuracy has always been a difficult point in practical work. [0003] Existing schemes generally use Dirichlet distribution to describe the distribution of topics in documents and the distribution of words under different topics. Through repeated statistical sampling of the input corpus, the v...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/35G06F40/258G06F40/284G06F40/289
Inventor 尹嘉路陈鸿丁文涛
Owner JIUYUAN QIANCHANG BEIJING TECH SERVICE CO LTD