research-article

ParaText: scalable text modeling and analysis

Authors:
Daniel M. Dunlavy

Sandia National Laboratories, Albuquerque, NM

Sandia National Laboratories, Albuquerque, NM
View Profile

,
Timothy M. Shead

Sandia National Laboratories, Albuquerque, NM

Sandia National Laboratories, Albuquerque, NM
View Profile

,
Eric T. Stanton

Sandia National Laboratories, Albuquerque, NM

Sandia National Laboratories, Albuquerque, NM
View Profile

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed ComputingJune 2010Pages 344–347https://doi.org/10.1145/1851476.1851526

Published:21 June 2010Publication History

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Pages 344–347

ABSTRACT

Automated analysis of unstructured text documents (e.g., web pages, newswire articles, research publications, business reports) is a key capability for solving important problems in areas including decision making, risk assessment, social network analysis, intelligence analysis, scholarly research and others. However, as data sizes continue to grow in these areas, scalable processing, modeling, and semantic analysis of text collections becomes essential. In this paper, we present the ParaText text analysis engine, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents. Results on several document collections using hundreds of processors are presented to illustrate the flexibility, extensibility, and scalability of the the entire process of text modeling from raw data ingestion to application analysis.

References

}}C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thornquist. Anasazi software for the numerical solution of large-scale eigenvalue problems. ACM TOMS, 36(3):13:1--13:23, 2009. Google ScholarDigital Library
}}M. W. Berry and D. I. Martin. Parallel SVD for scalable information retrieval. In Proc. Intl. Workshop on Parallel Matrix Algorithms and Applications, Neuchatel, Switzerland, 2000.Google Scholar
}}P. Crossno, D. Dunlavy, and T. Shead. LSAView: A tool for visual exploration of latent semantic modeling. In Proc. IEEE VAST, 2009.Google ScholarCross Ref
}}S. T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers, 23(2):229--236, 1991.Google Scholar
}}M. T. Egner, M. Lorch, and E. Biddle. Uima grid: Distributed large-scale text analysis. In Proc. of the 7th IEEE International Symposium on Cluster Computing and the Grid, pages 317--326, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
}}R. T. Fielding and R. N. Taylor. Principled design of the modern web architecture. ACM TOIT, 2(2):115--150, 2002. Google ScholarDigital Library
}}M. Krishnan, S. Bohn, W. Cowley, and J. Crow, V. and Nieplocha. Scalable visual analytics of massive textual datasets. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--10, 26--30 March 2007.Google ScholarCross Ref
}}S. Plimpton and K. Devine. MapReduce-MPI Library. http://www.sandia.gov/~sjplimp/mapreduce.html.Google Scholar
}}G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971. Google ScholarDigital Library
}}The Unicode Consortium. The Unicode Standard, Version 5.0 (5th Edition). Addison-Wesley Professional, 2006.Google Scholar
}}S. Vigna. Distributed, large-scale latent semantic analysis by index interpolation. In Proc. InfoScale, pages 1--10, 2008. Google ScholarDigital Library
}}D. Widdows and K. Ferraro. Semantic vectors: a scalable open source package and online technology management application. In Proc. LREC, 2008.Google Scholar
}}B. Wylie and J. Baumes. A unified toolkit for information and scientific visualization. In SPIE, 2009.Google ScholarCross Ref
}}J. Yan, S. Yan, N. Liu, and Z. Chen. Straightforward feature selection for scalable latent semantic indexing. In Proc. SDM, pages 1159--1170, 2009.Google ScholarCross Ref

Index Terms

ParaText: scalable text modeling and analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Toward a big data analysis system for historical newspaper collections research
PASC '22: Proceedings of the Platform for Advanced Scientific Computing Conference

The availability and generation of digitized newspaper collections have provided researchers in several domains with a powerful tool to advance their research. More specifically, digitized historical newspapers give us a magnifying glass into the past. ...
Read More
Analysis of unstructured text data for a person social profile
eGose '17: Proceedings of the Internationsl Conference on Electronic Governance and Open Society: Challenges in Eurasia

The greatest scientific interest for analysts are Internet open social data, because it has a direct link with all kinds of human activity. However, these data are not suitable for the application in its original form. Information should be presented in ...
Read More
Visual information extraction

Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
June 2010
911 pages
ISBN:9781605589428
DOI:10.1145/1851476
General Chairs:
Salim Hariri
University of Arizona
,
Kate Keahey
University of Chicago
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
text analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate166of966submissions,17%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 277
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ParaText: scalable text modeling and analysis

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward a big data analysis system for historical newspaper collections research

Analysis of unstructured text data for a person social profile

Visual information extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

ParaText: scalable text modeling and analysis

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward a big data analysis system for historical newspaper collections research

Analysis of unstructured text data for a person social profile

Visual information extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media