Subject visualization method for Chinese document set

A document collection and theme technology, applied in the field of text visualization and theme analysis, can solve problems such as inapplicability of Chinese documents, lack of universality of visualization technology, lack of theme visualization technology of Chinese documents, etc.

Active Publication Date: 2014-03-12
SICHUAN UNIV
View PDF3 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The above text visualization technologies lack versatility and are not suitable for Chinese documents. So far in China, there is still a lack of visualization technologies for analyzing the topics of Chinese documents
In addition, the TIARA technology, which only aims at the topic visualization of English doc

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subject visualization method for Chinese document set
  • Subject visualization method for Chinese document set
  • Subject visualization method for Chinese document set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0072] Example 1: Taking the journal data of "Journal of Software" as an example, combining figure 1 , To show the visualization method of Chinese document theme.

[0073] Step one, classify the document set by subject: suppose that the document set has n topics l j ,j=0,1,2,...,n-1, classify all documents in the document set according to the theme, and get n document subset D j ,J=0,1,2,...,n-1; among them, subject l j The corresponding document subset is D j . Specifically: Enter the paper data from the 1st to 9th issues of the Journal of Software. The document set is classified according to five themes of system software and software engineering, database technology, computer network and information security, pattern recognition and artificial intelligence, and operating system, and five document subsets are obtained.

[0074] Step 2: Divide the document set time period: set the start time of the document set to t start , The end time is t end , For the document set time perio...

Embodiment 2

[0082] Embodiment 2: In the above-mentioned method for visualizing topics in a Chinese document set, the topics are arranged in random order. When generating a topic stream, if the intensity of a topic changes too much, the shape of the adjacent topic will be distorted, making the result unsightly, and the relative strength between the topics is also difficult to identify. In addition, the distorted theme will also affect the placement of the word cloud. At the same time, for all topics in a document set, users tend to be more concerned about the specific content of the topic with the strongest topic strength. Therefore, the present invention further improves the step of sorting topics in Embodiment 1, and designs a sorting method based on topic frequency and geometric complementarity to sort topics. Combine below image 3 Explain the sorting method in detail:

[0083] Step 1, set theme l j The start time is OT j ; When v j,0 When not equal to zero, take the start time t of the...

Embodiment 3

[0106] Example 3: In view of the unstable shape and layout of the word cloud in TIARA technology, the present invention also improves the word cloud. First, the topic is divided into several sub-areas, and then a scalable algorithm is used (quoted from "Tag Cloud++-Scalable "Tag Clouds for Arbitrary Layouts" article) express the area as a set of horizontal line segments, and then place keywords in sequence to generate a word cloud. The visual characteristics are as follows: 1) the greater the weight of the keyword, the larger the font; 2) the greater the weight of the keyword, the closer to the center of the area. Combine below Figure 5 , Image 6 Detailed description:

[0107] Step 1: Select topic l on the topic flow chart j Corresponding area G j , Its start time and end time are respectively equal to the start time t of the document set start And end time t end , The area G j Time period [t start ,t end ] Equally divided into m-1 segments, the length of each time segment is...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.

Description

Technical field [0001] The invention relates to the field of text visualization and topic analysis, and in particular to a topic visualization method of a Chinese document set. Background technique [0002] Large collections of documents, such as news, scientific literature, web pages and electronic publications, announcements, etc., contain a lot of information. With the development and popularization of information digitization, the scale of document collections is expanding day by day. Quickly reading and understanding the vast amount of information, and extracting useful knowledge from it, has become an urgent problem for people to solve. [0003] "Theme" usually includes a core event or activity, and all events and activities directly related to it. The subject detection method uses technologies such as clustering, classification, retrieval, and subject tracking, and hierarchically categorizes and organizes the document set according to the subject, which is convenient for us...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 朱敏梁婷甘启宏李明召李一
Owner SICHUAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products