ABSTRACT
Automated analysis of unstructured text documents (e.g., web pages, newswire articles, research publications, business reports) is a key capability for solving important problems in areas including decision making, risk assessment, social network analysis, intelligence analysis, scholarly research and others. However, as data sizes continue to grow in these areas, scalable processing, modeling, and semantic analysis of text collections becomes essential. In this paper, we present the ParaText text analysis engine, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents. Results on several document collections using hundreds of processors are presented to illustrate the flexibility, extensibility, and scalability of the the entire process of text modeling from raw data ingestion to application analysis.
- }}C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thornquist. Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM TOMS, 36(3):13:1--13:23, 2009. Google ScholarDigital Library
- }}M. W. Berry and D. I. Martin. Parallel SVD for scalable information retrieval. In Proc. Intl. Workshop on Parallel Matrix Algorithms and Applications, Neuchatel, Switzerland, 2000.Google Scholar
- }}P. Crossno, D. Dunlavy, and T. Shead. LSAView: A tool for visual exploration of latent semantic modeling. In Proc. IEEE VAST, 2009.Google ScholarCross Ref
- }}S. T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers, 23(2):229--236, 1991.Google Scholar
- }}M. T. Egner, M. Lorch, and E. Biddle. Uima grid: Distributed large-scale text analysis. In Proc. of the 7th IEEE International Symposium on Cluster Computing and the Grid, pages 317--326, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- }}R. T. Fielding and R. N. Taylor. Principled design of the modern web architecture. ACM TOIT, 2(2):115--150, 2002. Google ScholarDigital Library
- }}M. Krishnan, S. Bohn, W. Cowley, and J. Crow, V. and Nieplocha. Scalable visual analytics of massive textual datasets. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--10, 26--30 March 2007.Google ScholarCross Ref
- }}S. Plimpton and K. Devine. MapReduce-MPI Library. http://www.sandia.gov/~sjplimp/mapreduce.html.Google Scholar
- }}G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971. Google ScholarDigital Library
- }}The Unicode Consortium. The Unicode Standard, Version 5.0 (5th Edition). Addison-Wesley Professional, 2006.Google Scholar
- }}S. Vigna. Distributed, large-scale latent semantic analysis by index interpolation. In Proc. InfoScale, pages 1--10, 2008. Google ScholarDigital Library
- }}D. Widdows and K. Ferraro. Semantic vectors: a scalable open source package and online technology management application. In Proc. LREC, 2008.Google Scholar
- }}B. Wylie and J. Baumes. A unified toolkit for information and scientific visualization. In SPIE, 2009.Google ScholarCross Ref
- }}J. Yan, S. Yan, N. Liu, and Z. Chen. Straightforward feature selection for scalable latent semantic indexing. In Proc. SDM, pages 1159--1170, 2009.Google ScholarCross Ref
Index Terms
- ParaText: scalable text modeling and analysis
Recommendations
Toward a big data analysis system for historical newspaper collections research
PASC '22: Proceedings of the Platform for Advanced Scientific Computing ConferenceThe availability and generation of digitized newspaper collections have provided researchers in several domains with a powerful tool to advance their research. More specifically, digitized historical newspapers give us a magnifying glass into the past. ...
Analysis of unstructured text data for a person social profile
eGose '17: Proceedings of the Internationsl Conference on Electronic Governance and Open Society: Challenges in EurasiaThe greatest scientific interest for analysts are Internet open social data, because it has a direct link with all kinds of human activity. However, these data are not suitable for the application in its original form. Information should be presented in ...
Visual information extraction
Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In ...
Comments