Unlock instant, AI-driven research and patent intelligence for your innovation.

TF-IDF keyword extraction-based improvement method

A TF-IDF, keyword technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of time-consuming and laborious keywords, lack of keyword tags, etc., and achieve the effect of accurate keyword extraction results

Inactive Publication Date: 2018-06-15
TONGJI UNIV
View PDF4 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

With the development of the Internet, online resources are becoming more and more abundant. However, most articles lack keyword tags. Manually tagging keywords is time-consuming and laborious and is highly subjective. Therefore, this technology is of great significance for keyword extraction of texts.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • TF-IDF keyword extraction-based improvement method
  • TF-IDF keyword extraction-based improvement method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0026] The improved method of the present invention based on TF-IDF keyword extraction specifically comprises the following steps, as figure 1 :

[0027] S1, respectively counting the number of occurrences of all words in each document in the document collection;

[0028] S2, using the improved TF-IDF formula to calculate the weight of words;

[0029] S3. The words are sorted according to the weights from large to small, and the sorted results are used as the basis for keyword retrieval in the database.

[0030] This method first considers that for short text corpus, the importance of each part in the text is different. Generally speaking, verbs, nouns, and adjectives are the main part of a sentence, and are also important for keyword extraction technology; numerals, pronouns, etc. are only modifiers, which further improve the integrity of the sentence, but there is almost no classification of sentences. effect. Therefore, this method assigns different important coefficien...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a TF-IDF (term frequency-inverse document frequency) keyword extraction-based improvement method. The method specifically comprises the following steps of: respectively carrying out statistic on frequencies, occurring in each text, of all the words in a document set; calculating weight values of the words by utilizing an improved TF-IDF formula; and sorting the words according to the weight values from high to low, and taking the sorting result as text keyword retrieval basis. Compared with the prior art, the method has the advantages of distinguishing words with different parts of speech, and considering the keywords which practically represent text features to carry out keyword sorting and optimization.

Description

technical field [0001] The invention relates to a natural language processing method, in particular to an improved method for extracting keywords based on TF-IDF. Background technique [0002] TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency, and IDF means Inverse Document Frequency. [0003] Keyword extraction is a natural language processing technique that identifies meaningful and representative segments or words. Keyword extraction is one of the foundations and cores of natural language processing technology. Most unstructured text processing technologies, such as text summarization, text classification, text clustering, and automatic question answering, rely on it to improve accuracy. With the development of the Internet, online resources are becoming more and more abundant. However, most articles lack keyword tags. Manually tagging keywords is time-consuming an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
CPCG06F40/20G06F40/216
Inventor 向阳郑惺张默涵赵雨晴
Owner TONGJI UNIV