Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Self-information based discovery method for co-occurrent topic in interdisciplinary field

A discovery method and self-information technology, applied in special data processing applications, instruments, unstructured text data retrieval, etc., can solve the problems that co-occurrence topic information cannot be extracted well, and cannot be used to extract co-occurrence subject words, etc.

Inactive Publication Date: 2015-12-09
SHANGHAI UNIV
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, these methods cannot extract co-occurring topic information in interdisciplinary fields very well, because for texts in evaluative interdisciplinary fields, sometimes the theme may be reflected by low-frequency subject words instead of high-frequency words
Most of the existing topic discovery methods tend to obtain high-frequency words, so they cannot be used to extract co-occurring topic words with low-frequency characteristics, that is, low-frequency topic words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Self-information based discovery method for co-occurrent topic in interdisciplinary field
  • Self-information based discovery method for co-occurrent topic in interdisciplinary field
  • Self-information based discovery method for co-occurrent topic in interdisciplinary field

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0033] refer to figure 1 , this self-information-based discovery of co-occurrence topics in interdisciplinary fields is characterized in that: the operation steps include:

[0034] (1) Data collection: collect the self-assessment documents of highly cited authors on their scientific research success;

[0035] (2) Data processing: extracting and digitizing the text part of the self-assessment;

[0036] (3), extract candidate low-frequency topic words;

[0037] (4), calculate low-frequency theme evaluation coefficient;

[0038] (5), setting the threshold value of evaluation coefficient of low-frequency subject words;

[0039] (6) Filter low-frequency keywords.

Embodiment 2

[0040] Embodiment 2: This embodiment is basically the same as Embodiment 1, and the special features are as follows:

[0041] The specific operation of the data collection in the step (1) is: collect 3790 highly cited classic documents from the self-assessment of the authors of the highly cited classic documents collected by Garfield, the founder of the citation database SCI, about the success of their scientific research work Author self-assessment documentation collection for .

[0042] The specific operation of the data processing in the step (2) is: digitize and extract the text in the document collection; in addition, three types of information are extracted: the text content of the self-assessment, the relevant information of the self-assessment, and the relevant information of the original highly cited documents. information.

[0043] The specific operation of the step (3) extracting candidate low-frequency subject words is: firstly utilize the "Natural Language Toolse...

Embodiment 3

[0054] Such as figure 1 As shown, this method for discovering co-occurrence topics in interdisciplinary fields based on self-information includes the following steps:

[0055] (1) Data collection. Access more than 5,000 documents in PDF format in the Garfield Electronic Library at the University of Pennsylvania. Through the three data preprocessing tasks of deleting noise data, deleting duplicate data, and discarding missing data, a total of 3,790 available documents with complete information were obtained, and a self-assessment document set was established.

[0056] (2), data processing. The text portion of the self-assessment in the dossier was extracted and digitized. In addition, three types of information were extracted, the text content of the self-evaluation, relevant information of the self-evaluation (such as: the author of the self-evaluation, the address of the author, the year of the self-evaluation, and the subject field label of the self-evaluation), and the o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a self-information based discovery method for a co-occurrent topic in the interdisciplinary field. The method comprises the following specific steps of: (1), data collection: collecting a self-evaluation document set of a highly cited document author on success of scientific research; (2), data processing: extracting and digitizing a text part in self evaluation; (3), extracting candidate low-frequency topic words; (4), calculating a low-frequency topic evaluation coefficient; (5), setting a threshold value of the low-frequency topic word evaluation coefficient; and (6), filtering the low-frequency topic words. The method provides a new idea for related research of topic discovery. Not only are high-frequency words closely related with the topic but also the low-frequency words are available resources. The method can be applied to discovery of topics in evaluation type document sets, for example, common experience of characters is extracted from autobiographical themes and common indexes for stock evaluation are extracted from stock comments, thereby realizing extraction of co-occurrent topics in document sets in different disciplinary fields.

Description

technical field [0001] The invention relates to a method for discovering co-occurrence topics in interdisciplinary fields based on self-information, and belongs to the field of text mining (TextMining). Background technique [0002] In recent years, topic discovery, as a hot research direction in the field of text mining, has been paid more and more attention by researchers. Topic discovery can mine key topic information from massive unstructured texts, understand the main content of texts more efficiently, and obtain deep semantic information of texts. At the same time, topic discovery can also conduct deeper analysis on topics and discover more potential knowledge in texts. [0003] Existing topic discovery methods mainly include topic models and word frequency statistics. The topic model is a probabilistic generative model in which topics are hidden variables and documents and terms are observations. Through the training of the model, the probability distribution of te...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/3344G06F16/36
Inventor 夏晴周文张亚军刘孟
Owner SHANGHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products