skip to main content
10.1145/1851476.1851526acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

ParaText: scalable text modeling and analysis

Published:21 June 2010Publication History

ABSTRACT

Automated analysis of unstructured text documents (e.g., web pages, newswire articles, research publications, business reports) is a key capability for solving important problems in areas including decision making, risk assessment, social network analysis, intelligence analysis, scholarly research and others. However, as data sizes continue to grow in these areas, scalable processing, modeling, and semantic analysis of text collections becomes essential. In this paper, we present the ParaText text analysis engine, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents. Results on several document collections using hundreds of processors are presented to illustrate the flexibility, extensibility, and scalability of the the entire process of text modeling from raw data ingestion to application analysis.

References

  1. }}C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thornquist. Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM TOMS, 36(3):13:1--13:23, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}M. W. Berry and D. I. Martin. Parallel SVD for scalable information retrieval. In Proc. Intl. Workshop on Parallel Matrix Algorithms and Applications, Neuchatel, Switzerland, 2000.Google ScholarGoogle Scholar
  3. }}P. Crossno, D. Dunlavy, and T. Shead. LSAView: A tool for visual exploration of latent semantic modeling. In Proc. IEEE VAST, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. }}S. T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers, 23(2):229--236, 1991.Google ScholarGoogle Scholar
  5. }}M. T. Egner, M. Lorch, and E. Biddle. Uima grid: Distributed large-scale text analysis. In Proc. of the 7th IEEE International Symposium on Cluster Computing and the Grid, pages 317--326, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}R. T. Fielding and R. N. Taylor. Principled design of the modern web architecture. ACM TOIT, 2(2):115--150, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}M. Krishnan, S. Bohn, W. Cowley, and J. Crow, V. and Nieplocha. Scalable visual analytics of massive textual datasets. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--10, 26--30 March 2007.Google ScholarGoogle ScholarCross RefCross Ref
  8. }}S. Plimpton and K. Devine. MapReduce-MPI Library. http://www.sandia.gov/~sjplimp/mapreduce.html.Google ScholarGoogle Scholar
  9. }}G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}The Unicode Consortium. The Unicode Standard, Version 5.0 (5th Edition). Addison-Wesley Professional, 2006.Google ScholarGoogle Scholar
  11. }}S. Vigna. Distributed, large-scale latent semantic analysis by index interpolation. In Proc. InfoScale, pages 1--10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}D. Widdows and K. Ferraro. Semantic vectors: a scalable open source package and online technology management application. In Proc. LREC, 2008.Google ScholarGoogle Scholar
  13. }}B. Wylie and J. Baumes. A unified toolkit for information and scientific visualization. In SPIE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  14. }}J. Yan, S. Yan, N. Liu, and Z. Chen. Straightforward feature selection for scalable latent semantic indexing. In Proc. SDM, pages 1159--1170, 2009.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. ParaText: scalable text modeling and analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
      June 2010
      911 pages
      ISBN:9781605589428
      DOI:10.1145/1851476

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Author Tags

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate166of966submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader