Skip to main content

Chrum: The Tool for Convenient Generation of Apache Oozie Workflows

  • Chapter
  • First Online:
  • 621 Accesses

Part of the book series: Studies in Computational Intelligence ((SCI,volume 541))

Abstract

Conducting a research in an efficient, repetitive, evaluable, but also convenient (in terms of development) way has always been a challenge. To satisfy those requirements in a long term and simultaneously minimize costs of the software engineering process, one has to follow a certain set of guidelines. This article describes such guidelines based on the research environment called Content Analysis System (CoAnSys) created in the Center for Open Science (CeON). In addition to best practices for working in the Apache Hadoop environment, the tool for convenient generation of Apache Oozie workflows is presented.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://arkitus.com/PRML/

  2. 2.

    http://github.com/CeON/CoAnSys

  3. 3.

    http://code.google.com/p/protobuf/

  4. 4.

    http://thrift.apache.org/

  5. 5.

    http://avro.apache.org/

  6. 6.

    http://oozie.apache.org/

  7. 7.

    http://www.openaire.eu/

  8. 8.

    http://pbn.nauka.gov.pl/

  9. 9.

    http://www.infona.pl/

  10. 10.

    http://github.com/CeON/chrum

References

  1. Bembenik, R., Skonieczny, L., Rybinski, H., Niezgodka, M.: Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence. Springer, Berlin (2012)

    Google Scholar 

  2. Chu, C.T., Kim, S.K., Lin, Y.A., Ng, A.Y.: Map-reduce for machine learning on multicore. Architecture 19(23), 281 (2007). http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf

    Google Scholar 

  3. Dean, B.Y.J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010). http://dl.acm.org/citation.cfm?id=1629198

    Google Scholar 

  4. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 1–„13 (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264

  5. Dean, J., Ghemawat, S.: System and Method for Efficient Large-scale Data Processing (2010). http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=7,526,461&OS=7,526,461&RS=7,526,461

  6. Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using apache hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, vol. 8092, pp. 362–365. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-40501-3_37

  7. Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3–4), 327–348 (2004). http://www.journals.cambridge.org/abstract_S1351324904003523

  8. Gates, A.: Programming Pig. O’Reilly Media, Sebastopol (2011)

    Google Scholar 

  9. George, L.: HBase: The Definitive Guide, 1 edn. O’Reilly Media, Sebastopol (2011)

    Google Scholar 

  10. Kawa, A., Bolikowski, Ł., Czeczko, A., Dendek, P., Tkaczyk, D.: Data model for analysis of scholarly documents in the mapreduce paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence, vol. 467, pp. 155–169. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-35647-6_12

  11. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010). http://genome.cshlp.org/cgi/doi/10.1101/gr.107524.110

  12. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceeding of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, vol. 0, 13–24 Oct 2007. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4147644

  13. White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media Inc., Sebastopol (2009)

    Google Scholar 

  14. Yang, H.c., Dasdan, A., Hsiao, R.l., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. Rain pages, 1029–1040 (2007), http://portal.acm.org/citation.cfm?id=1247480.1247602

  15. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. Symp. Q. J. Mod. Foreign Lit. 57(4), 29–42 (2008). http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia_html/

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Piotr Jan Dendek , Artur Czeczko , Mateusz Fedoryszak , Adam Kawa , Piotr Wendykier or Łukasz Bolikowski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Dendek, P.J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P., Bolikowski, Ł. (2014). Chrum: The Tool for Convenient Generation of Apache Oozie Workflows. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-04714-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-04713-3

  • Online ISBN: 978-3-319-04714-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics