Abstract
In machine learning and data mining, multidimensional scaling (MDS) and MDS-like methods are extensively used for dimensionality reduction and for gaining insights into overwhelming amounts of data through visualization. With the growth of the Web and activities of Web users, the amount of data not only grows exponentially but is also becoming available in the form of streams, where new data instances constantly flow into the system, requiring the algorithm to update the model in near-real time. This paper presents an algorithm for document stream visualization through a MDS-like distance-preserving projection onto a 2D canvas. The visualization algorithm is essentially a pipeline employing several methods from machine learning. Experimental verification shows that each stage of the pipeline is able to process a batch of documents in constant time. It is shown that in the experimental setting with a limited buffer capacity and a constant document batch size, it is possible to process roughly 2.5 documents per second which corresponds to approximately 25% of the entire blogosphere rate and should be sufficient for most real-life applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Albrecht-Buehler, C., Watson, B., Shamma, D.A.: Visualizing Live Text Streams Using Motion and Temporal Pooling. IEEE Computer Graphics and Applications 25/3, 52–59 (2005)
Havre, S., Hetzler, B., Nowell, L.: ThemeRiver: Visualizing Theme Changes over Time. In: Proceedings of InfoVis 2000, pp. 115–123 (2000)
Shaparenko, B., Caruana, R., Gehrke, J., Joachims, T.: Identifying Temporal Patterns and Key Players in Document Collections. In: Proceedings of TDM 2005, pp. 165–174 (2005)
Krstajić, M., Mansmann, F., Stoffel, A., Atkinson, M., Keim, D.A.: Processing Online News Streams for Large-scale Semantic Analysis. In: Proceedings of DESWeb 2010 (2010)
Fortuna, B., Grobelnik, M., Mladenić, D.: Visualization of Text Document Corpus. Informatica, pp. 270–277 (2005)
Deerwester, S., Dumais, S., Furnas, G., Landuer, T., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41/6, 391–407 (1990)
Groenen, P.J.F., van der Velden, M.: Multidimensional Scaling. Econometric Institute Report EI 2004-15, Netherlands, April 6 (2004)
Paulovich, F.V., Nonato, L.G., Minghim, R.: Visual Mapping of Text Collections through a Fast High Precision Projection Technique. In: Proceedings of the 10th Conference on Information Visualization, pp. 282–290 (2006)
Salton, G.: Developments in Automatic Text Retrieval. Science 253, 974–979 (1991)
Hartigan, J.A., Wong, M.A.: Algorithm 136: A k-Means Clustering Algorithm. Applied Statistics 28, 100–108 (1979)
Gansner, E.R., Koren, Y., North, S.C.: Graph Drawing by Stress Majorization, pp. 239–250 (2004)
Sorkine, O., Cohen-Or, D.: Least-Squares Meshes. In: Proceedings of Shape Modeling International, pp. 191–199 (2004)
Paige, C.C., Saunders, M.A.: Algorithm 583: LSQR: Sparse Linear Equations and Least Squares Problems. ACM Transactions on Mathematical Software 8, 195–209 (1982)
Rakhlin, A., Caponnetto, A.: Stability of k-Means Clustering. In: Advances in Neural Information Processing Systems, pp. 1121–1128 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grčar, M., Podpečan, V., Juršič, M., Lavrač, N. (2010). Efficient Visualization of Document Streams. In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds) Discovery Science. DS 2010. Lecture Notes in Computer Science(), vol 6332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16184-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-16184-1_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16183-4
Online ISBN: 978-3-642-16184-1
eBook Packages: Computer ScienceComputer Science (R0)