Inferring emerging and evolving topics in streaming text

a streaming text and topic technology, applied in the field of document analysis, can solve the problems of underlying topics drifting, new twists in the problem of streaming datasets,

Inactive Publication Date: 2013-06-13
IBM CORP
View PDF5 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0015]In one embodiment of the invention, given streaming matrices, a sequence of NMFs is learned with two forms of temporal regularization. The first regularizer enforces smooth evolution of topics via constraints on amount of drift allowed. The second regularizer applies to an additional “topic bandwidth” introduced into the system for early detection of emerging trends. Implicitly, this regularizer extracts smooth trends of candidate emerging topics and then encourages the discovery of those that are rapidly growing over a short time window. This setup is formulated as an objective function which reduces to rank-one subproblems involving projections onto the probability simplex and SVM-like optimization with additional non-negativity constraints. Embodiments of the invention provide efficient algorithms for finding stationary points of this objective function. Since they mainly involve matrix-vector operations and linear-time subroutines, these algorithms scale gracefully to large datasets.

Problems solved by technology

Streaming datasets present new twists to the problem.
Consider the problem of building compact, dynamic representations of streaming datasets such as those that arise in social media.
Each new batch of documents arriving at a timepoint is completely unorganized and may contribute either to ongoing unknown topics of discussion (potentially causing underlying topics to drift over time) and / or initiate new themes that may or may not become significant going forward, and / or simply inject irrelevant “noise”.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Inferring emerging and evolving topics in streaming text
  • Inferring emerging and evolving topics in streaming text
  • Inferring emerging and evolving topics in streaming text

Examples

Experimental program
Comparison scheme
Effect test

case 1

[0049] hi is evolving: For an evolving topic, the optimization needs to be performed under the constraints of equations (4) and (3). Thus the optimum hi* is obtained by projection onto the set ={hi: hi ∈ΔD, lj≦hij≦uj} for appropriate constants lj and uj. This isJ equivalent to a projection onto a simplex with box constraints. Adapting a method due to [P. M. Pardalos and N. Kovoor. An algorithm for singly constrained class of quadratic programs subject to upper and lower bounds. Mathematical Programming, 46:321-328, 1990], we can find the minimizer in O(D) time i.e., linear in the number of coordinates.

case 2

[0050] hi is emerging: For an emerging topic ={hi:hi ∈ΔD} and the optimization equation (8) becomes equivalent to a projection onto the simplex ΔD. the same algorithm [P. M. Pardalos and N. Kovoor, An algorithm for singly constrained class of quadratic programs subject to upper and lower bounds, Mathematical Programming, 46:321-328, 1990] again gives us the minimizer in linear time O(D).

[0051]Optimization over evolving wi: When wi ∈ Wev, the second term in equation (7) does not contribute and using the RRI scheme, the optimization problem can be written down as wi*=arg minwi≧0∥R−wihiT∥2. Similar to equation (8), simple algebraic operations yield that the above minimization is equal to the following simple projection algorithm

argminwi≥0wi-Rhi / hi22(9)

The corresponding minimizer is simply given by

wij=max(0,1hi2(Rhi)j).argminwi≥0R-wihi⊤2+L(Swi)

[0052]Emerging wi: When wi ∈ Wem, the RRI step of the corresponding optimization problem look like

argminwi≥0wi-Rhi2+μL(Swi) / hi2(10)[0053]Noting t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method, system and computer program product for inferring topic evolution and emergence in a set of documents. In one embodiment, the method comprises forming a group of matrices using text in the documents, and analyzing these matrices to identify evolving topics and emerging topics. The matrices includes a matrix X identifying a multitude of words in each of the documents, a matrix W identifying a multitude of topics in each of the documents, and a matrix H identifying a multitude of words for each of the multitude of topics. These matrices are analyzed to identify the evolving and emerging topics. In an embodiment, two forms of temporal regularizers are used to help identify the evolving and emerging topics. In another embodiment, a two stage approach involving detection and clustering is used to help identify the evolving and emerging topics.

Description

CROSS REFERENCE TO RELATED APPLICATION[0001]This application is a continuation of copending U.S. patent application Ser. No. 13 / 315,798, filed Dec. 9, 2011, the entire content and disclosure of which is hereby incorporated herein by reference.BACKGROUND OF THE INVENTION[0002]The present invention generally relates to document analysis, and more specifically, to inferring topic evolution and emergence in streaming documents.[0003]Learning a dictionary of basis elements with the objective of building compact data representations is a problem of fundamental importance in statistics, machine learning and signal processing. In many settings, data points appear as a stream of high dimensional feature vectors. Streaming datasets present new twists to the problem. On one hand, basis elements need to be dynamically adapted to the statistics of incoming datapoints, while on the other hand, many applications require early detection of rising new trends. The analysis of social media streams for...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F17/2785G06F17/30619G06F16/316G06F40/30
Inventor ANKAN, SAHABANERJEE, ARINDAMKASIVISWANATHAN, SHIVA P.LAWRENCE, RICHARD D.MELVILLE, PREMSINDHWANI, VIKASTING, EDISON L.
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products