A Topic Visualization Method for Chinese Document Collection

A document collection and topic technology, applied in the field of text visualization and topic analysis, can solve problems such as inapplicability to Chinese documents, lack of topic visualization technology for Chinese documents, user misunderstandings, etc.

Active Publication Date: 2017-01-11
SICHUAN UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The above text visualization technologies lack versatility and are not suitable for Chinese documents. So far in China, there is still a lack of visualization technologies for analyzing the topics of Chinese documents
In addition, the TIARA technology, which only aims at the topic visualization of English documents, also has the following problems: 1) The shape and layout of the word cloud in the topic stream are unstable, which may easily cause misunderstandings for users and affect the effect of topic analysis; 2) Due to regional restrictions, the generated The word cloud cannot show all the key content of each topic

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Topic Visualization Method for Chinese Document Collection
  • A Topic Visualization Method for Chinese Document Collection
  • A Topic Visualization Method for Chinese Document Collection

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0072] Embodiment 1: Take the journal data of "Journal of Software" as an example below, combined with figure 1 , showing a method for visualizing Chinese document topics.

[0073] Step 1, classify the document set according to the theme: suppose the document set has n subjects l j , j=0, 1, 2, ..., n-1, classify all the documents in the document set according to the topic, and get n document subsets D j , j=0, 1, 2, ..., n-1; among them, topic l j The corresponding document subset is D j . Specifically: input the data of the papers from the first to the ninth period of the journal "Journal of Software". The document set is classified according to five themes of system software and software engineering, database technology, computer network and information security, pattern recognition and artificial intelligence, and operating system, and five document subsets are obtained.

[0074] Step 2, divide the time period of the document set: set the start time of the document se...

Embodiment 2

[0082] Embodiment 2: In the above-mentioned method for visualizing topics of Chinese document collections, the order of each topic is randomly arranged. When generating topic streams, if the intensity of a topic varies too much, the shapes of adjacent topics will be distorted, making the result unsightly, and the relative intensities between topics are difficult to discern. Additionally, distorted themes also affect word cloud placement. At the same time, for all topics in a document set, users tend to care more about the specific content of the topic with the strongest topic intensity. Therefore, the present invention further improves the step of sorting topics in Embodiment 1, and designs a sorting method based on topic frequency and geometric complementarity to sort topics. Combine below image 3 To elaborate on this sorting method:

[0083] Step 1, set the theme l j The start time is OT j ; when v j,0 When not equal to zero, take the start time t of the document set ...

Embodiment 3

[0106] Embodiment 3: Aiming at the problems of word cloud shape and unstable layout in TIARA technology, the present invention also improves word cloud, first theme is divided into several sub-regions, and then adopts scalable algorithm (quoted from "Tag Cloud++-Scalable Tag Clouds for Arbitrary Layouts" article) represents the area as a set of horizontal line segments, and then places keywords in sequence to generate a word cloud. The visual features are as follows: 1) The greater the weight of the keyword, the larger the font; 2) The closer the keyword with the greater weight is to the center of the area. Combine below Figure 5 , Figure 6 To elaborate:

[0107] Step 1: Select the topic l on the topic flow map j Corresponding area G j , whose start time and end time are respectively equal to the start time t of the document set start and end time t end , the region G j time period [t start ,t end ] are equally divided into m-1 segments, and the length of each time ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.

Description

technical field [0001] The invention relates to the fields of text visualization and theme analysis, and specifically relates to a theme visualization method of a Chinese document set. Background technique [0002] Large collections of documents, such as news, scientific literature, web pages and electronic publications, announcements, etc., contain a lot of information. With the development and popularization of information digitization, the scale of document collections is increasing day by day. It has become an urgent problem for people to quickly read and understand the vast amount of information and extract useful knowledge from it. [0003] A "topic" usually includes a central event or activity, and all events and activities directly related to it. The topic detection method uses clustering, classification, retrieval, topic tracking and other technologies to classify and organize the document set hierarchically according to the topic, which is convenient for users to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 朱敏梁婷甘启宏李明召李一
Owner SICHUAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products