skip to main content
10.1145/3524842.3528504acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
short-paper

LAGOON: an analysis tool for open source communities

Published:17 October 2022Publication History

ABSTRACT

This paper presents LAGOON - an open source platform for understanding the complex ecosystems of Open Source Software (OSS) communities. The platform currently utilizes spatiotemporal graphs to store and investigate the artifacts produced by these communities, and help analysts identify bad actors who might compromise an OSS project's security. LAGOON provides ingest of artifacts from several common sources, including source code repositories, issue trackers, mailing lists and scraping content from project websites. Ingestion utilizes a modular architecture, which supports incremental updates from data sources and provides a generic identity fusion process that can recognize the same community members across disparate accounts. A user interface is provided for visualization and exploration of an OSS project's complete sociotechnical graph. Scripts are provided for applying machine learning to identify patterns within the data. While current focus is on the identification of bad actors in the Python community, the platform's reusability makes it easily extensible with new data and analyses, paving the way for LAGOON to become a comprehensive means of assessing various OSS-based projects and their communities.

References

  1. Nesreen K. Ahmed, Ryan Rossi, John Boaz Lee, Theodore L. Willke, Rong Zhou, Xiangnan Kong, and Hoda Eldardiry. 2018. Learning Role-based Graph Embeddings. arXiv preprint arXiv:1802.02896 (2018).Google ScholarGoogle Scholar
  2. Samridhi Choudhary, Christopher Bogart, Carolyn Penstein Rosé, and James D. Herbsleb. 2018. Modeling Coordination and Productivity in Open-Source GitHub Projects. Technical Report. School of Computer Science, Carnegie Mellon University.Google ScholarGoogle Scholar
  3. Keith Collins. 2016. How one programmer broke the internet by deleting a tiny piece of code. Retrieved Jan 14, 2022 from https://qz.com/646467/how-one-programmer-broke-the-internet-by-deleting-a-tiny-piece-of-code/Google ScholarGoogle Scholar
  4. Neo4j Contrib. 2021. neovis.js. Retrieved Jan 17, 2022; commit hash 150d8b920a22d5a6072b2e63e45d00d752a77f52 from https://github.com/neo4j-contrib/neovis.js/Google ScholarGoogle Scholar
  5. Niels de Jong. 2022. neodash. Retrieved Jan 17, 2022; commit hash f746e291666a2641105d3a9555cf35d87a72334c from https://github.com/nielsdejong/neodashGoogle ScholarGoogle Scholar
  6. Fernando Doglio. 2021. Another Npm Package Is Highjacked and It's Your Fault That This Happened. Retrieved Jan 21, 2022 from https://blog.openreplay.com/another-npm-package-is-highjacked-and-it-s-your-fault-that-this-happenedGoogle ScholarGoogle Scholar
  7. Ted Enamorado, Benjamin Fifield,, and Kosuke Imai. 2019. Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records. American Political Science Review 113, 2 (May 2019), 353--371.Google ScholarGoogle ScholarCross RefCross Ref
  8. Max Franz, Christian T Lopes, Gerardo Huck, Yue Dong, Onur Sumer, and Gary D Bader. 2016. Cytoscape. js: a graph theory library for visualisation and analysis. Bioinformatics 32, 2 (2016), 309--311.Google ScholarGoogle ScholarCross RefCross Ref
  9. Tanner Fry, Tapajit Dey, Andrey Karnauch, and Audris Mockus. 2020. A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits. In Proceedings of the 17th International Conference on Mining Software Repositories. Association for Computing Machinery, New York, NY, USA, 518--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Vish Gain. 2022. Open source developer corrupts his own files, impacting millions. Retrieved Jan 14, 2022 from https://www.siliconrepublic.com/enterprise/github-marak-squires-colors-faker-npm-corrupt-open-sourceGoogle ScholarGoogle Scholar
  11. GraphAware. 2022. Hume. Retrieved Jan 17, 2022 from https://graphaware.com/products/hume/Google ScholarGoogle Scholar
  12. Graphistry. 2021. Graphistry. Retrieved Jan 17, 2022 from https://www.graphistry.com/Google ScholarGoogle Scholar
  13. The PostgreSQL Global Development Group. 2022. PostgreSQL: The World's Most Advanced Open Source Relational Database. Retrieved Jan 19, 2022 from https://www.postgresql.org/Google ScholarGoogle Scholar
  14. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv preprint arXiv:2005.00687 (2020).Google ScholarGoogle Scholar
  15. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv preprint arXiv:1909.09436 (2019).Google ScholarGoogle Scholar
  16. Galois Inc. and University of Vermont. 2022. SocialCyberLAGOON: v1.0-SocialCyberM6. Galois, Inc. Google ScholarGoogle ScholarCross RefCross Ref
  17. Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  18. Michael Klug and James Bagrow. 2014. Understanding the group dynamics and success of teams. Royal Society Open Science 3 (07 2014). Google ScholarGoogle ScholarCross RefCross Ref
  19. Python maintainers. 2000. PEP 0 - Index of Python Enhancement Proposals (PEPs). Retrieved Jan 14, 2022 from https://www.python.org/dev/peps/Google ScholarGoogle Scholar
  20. Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. 2020. TUDataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020). www.graphlearning.ioGoogle ScholarGoogle Scholar
  21. L. Oettershagen, N. M. Kriege, C. Morris, and P. Mutzel. 2020. Temporal graph kernels for classifying dissemination processes. In SIAM International Conference on Data Mining. 496--504.Google ScholarGoogle Scholar
  22. International Consortium of Investigative Journalists. 2013. Offshore Leaks Database - How to download this database. Retrieved Jan 14, 2022 from https://offshoreleaks.icij.org/pages/database Different datasets released in 2013, 2016, 2017, 2021.Google ScholarGoogle Scholar
  23. Tiago Peixoto. 2022. graph-tool | Efficient network analysis. Retrieved Jan 17, 2022 from https://graph-tool.skewed.de/Google ScholarGoogle Scholar
  24. Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Krishnaprasad Thirunarayan, Valerie L. Shalin, and Amit Sheth. 2018. A Quality Type-Aware Annotated Corpus and Lexicon for Harassment Research. In Proceedings of the 10th ACM Conference on Web Science. Association for Computing Machinery, New York, NY, USA, 33--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Samuel F. Rosenblatt, Amanda Casari, Sourya Dey, Walt Woods, and Laurent Hébert-Dufresne. 2022. Open is not always welcoming: Examining how toxic interpersonal signals impact and reflect collaboration and prestige in the open source Python language development community. In Sunbelt. Abstract accepted for conference presentation.Google ScholarGoogle Scholar
  26. Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20). Association for Computing Machinery, New York, NY, USA, 3125--3132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Rozenshtein, A. Gionis, B. A. Prakash, and J. Vreeken. 2016. Reconstructing an epidemic over time. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 1835--1844.Google ScholarGoogle Scholar
  28. Cambridge Semantics. 2022. Anzo Platform. Retrieved Jan 17, 2022 from https://cambridgesemantics.com/anzo-platform/Google ScholarGoogle Scholar
  29. SQLAlchemy. 2021. The Python SQL Toolkit and Object Relational Mapper. Retrieved Jan 19, 2022 from https://www.sqlalchemy.org/Google ScholarGoogle Scholar
  30. tiangolo. 2021. Typer. Retrieved Jan 24, 2022 from https://typer.tiangolo.com/Google ScholarGoogle Scholar
  31. TigerGraph. 2022. TigerGraph Connectors. Retrieved Jan 17, 2022 from https://www.tigergraph.com/connectors/Google ScholarGoogle Scholar
  32. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000--6010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Melanie Warrick, Samuel F. Rosenblatt, Jean-Gabriel Young, Amanda Casari, Laurent Hébert-Dufresne, and James P. Bagrow. 2022. The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories. In IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). Accepted, but the conference hasn't taken place yet.Google ScholarGoogle Scholar
  34. Igor Scaliante Wiese, José Teodoro Da Silva, Igor Steinmacher, Christoph Treude, and Marco Aurélio Gerosa. 2016. Who is Who in the Mailing List? Comparing Six Disambiguation Heuristics to Identify Multiple Addresses of a Participant. In IEEE International Conference on Software Maintenance and Evolution (ICSME). 345--355. Google ScholarGoogle ScholarCross RefCross Ref
  35. Evan You. 2021. The Progressive JavaScript Framework. Retrieved Jan 19, 2022 from https://v3.vuejs.org/Google ScholarGoogle Scholar
  36. Jean-Gabriel Young, Amanda Casari, Katie McLaughlin, Milo Z. Trujillo, Laurent Hébert-Dufresne, and James P. Bagrow. 2021. Which contributions count? Analysis of attribution in open source. In IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 242--253. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. LAGOON: an analysis tool for open source communities

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories
              May 2022
              815 pages
              ISBN:9781450393034
              DOI:10.1145/3524842

              Copyright © 2022 ACM

              Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 17 October 2022

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • short-paper

              Upcoming Conference

              ICSE 2025

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader