Keyword calculation method based on document clustering

A technology of document clustering and calculation method, applied in the direction of text database clustering/classification, calculation, unstructured text data retrieval, etc., can solve the problem of no technical solution and so on

Inactive Publication Date: 2015-12-16
HAINAN UNIVERSITY
View PDF5 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there is no specific technical plan to implement how to integrate a series of technologies, further refine the grouping of document collections, and mine representative keywords on the groupings

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Keyword calculation method based on document clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The technology involved in the present invention and its notes:

[0029] 1. Text clustering:

[0030] Text clustering (TextClustering) document clustering is mainly based on the well-known clustering assumption: documents of the same type have a greater similarity, while documents of different types have a smaller similarity. Text clustering can divide a relatively large collection of documents into several subcategories, so that similar documents can be organized in the same category. As an unsupervised machine learning method, clustering has certain flexibility and high automatic processing ability because it does not require a training process and manual labeling of documents in advance, and has become an effective method for organizing text information. , summary, and navigation.

[0031] The applications of text clustering technology mainly include:

[0032] Perform clustering operations on documents that users are interested in (such as news or products that us...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a keyword calculation method based on document clustering. The method comprises the following steps of: (1) obtaining a text document set; (2) performing word entry segmentation on all document contents in the document set by a word segmentation algorithm; (3) building a document vector; (4) calculating the document vector by the TF-IDF (Term Frequency-Inverse Document Frequency); (5) performing dimension compression on the document vector; (6) performing document clustering calculation; and (7) calculating representative keywords of each group of documents. The keyword calculation method has the beneficial effects that complete feasible calculation steps are provided; the document vector dimension compression is innovatively supported; and the calculation efficiency is high. When the dimension compression of the document vector is executed, a concise and efficient novel method different from any one technology in the prior art is adopted. The keyword calculation method belongs to a first technical scheme capable of calculating the representative keywords from the document set by connecting different links through feasible calculation steps.

Description

technical field [0001] The invention belongs to the field of computer data mining, and in particular relates to a method for calculating keywords based on document clustering. Background technique [0002] In the Internet industry, users often use keyword group searches to find articles that represent their interests. In the prior art, a given document collection is regarded as a complete and indivisible whole, and representative keywords are calculated on it. Typical applications include the personalized reading system of news websites, which can calculate a set of keywords representing user interests based on the news browsed by users, and recommend new articles based on this set of keywords. But in fact, a user's interest often includes multiple aspects and is scattered. Therefore, the corresponding document collection can be divided into several groups of documents, each group corresponds to a point of interest of the user, and the correlation between documents in each...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 周辉段玉聪叶春杨王磊
Owner HAINAN UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products