ABSTRACT
This paper presents LAGOON - an open source platform for understanding the complex ecosystems of Open Source Software (OSS) communities. The platform currently utilizes spatiotemporal graphs to store and investigate the artifacts produced by these communities, and help analysts identify bad actors who might compromise an OSS project's security. LAGOON provides ingest of artifacts from several common sources, including source code repositories, issue trackers, mailing lists and scraping content from project websites. Ingestion utilizes a modular architecture, which supports incremental updates from data sources and provides a generic identity fusion process that can recognize the same community members across disparate accounts. A user interface is provided for visualization and exploration of an OSS project's complete sociotechnical graph. Scripts are provided for applying machine learning to identify patterns within the data. While current focus is on the identification of bad actors in the Python community, the platform's reusability makes it easily extensible with new data and analyses, paving the way for LAGOON to become a comprehensive means of assessing various OSS-based projects and their communities.
- Nesreen K. Ahmed, Ryan Rossi, John Boaz Lee, Theodore L. Willke, Rong Zhou, Xiangnan Kong, and Hoda Eldardiry. 2018. Learning Role-based Graph Embeddings. arXiv preprint arXiv:1802.02896 (2018).Google Scholar
- Samridhi Choudhary, Christopher Bogart, Carolyn Penstein Rosé, and James D. Herbsleb. 2018. Modeling Coordination and Productivity in Open-Source GitHub Projects. Technical Report. School of Computer Science, Carnegie Mellon University.Google Scholar
- Keith Collins. 2016. How one programmer broke the internet by deleting a tiny piece of code. Retrieved Jan 14, 2022 from https://qz.com/646467/how-one-programmer-broke-the-internet-by-deleting-a-tiny-piece-of-code/Google Scholar
- Neo4j Contrib. 2021. neovis.js. Retrieved Jan 17, 2022; commit hash 150d8b920a22d5a6072b2e63e45d00d752a77f52 from https://github.com/neo4j-contrib/neovis.js/Google Scholar
- Niels de Jong. 2022. neodash. Retrieved Jan 17, 2022; commit hash f746e291666a2641105d3a9555cf35d87a72334c from https://github.com/nielsdejong/neodashGoogle Scholar
- Fernando Doglio. 2021. Another Npm Package Is Highjacked and It's Your Fault That This Happened. Retrieved Jan 21, 2022 from https://blog.openreplay.com/another-npm-package-is-highjacked-and-it-s-your-fault-that-this-happenedGoogle Scholar
- Ted Enamorado, Benjamin Fifield,, and Kosuke Imai. 2019. Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records. American Political Science Review 113, 2 (May 2019), 353--371.Google ScholarCross Ref
- Max Franz, Christian T Lopes, Gerardo Huck, Yue Dong, Onur Sumer, and Gary D Bader. 2016. Cytoscape. js: a graph theory library for visualisation and analysis. Bioinformatics 32, 2 (2016), 309--311.Google ScholarCross Ref
- Tanner Fry, Tapajit Dey, Andrey Karnauch, and Audris Mockus. 2020. A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits. In Proceedings of the 17th International Conference on Mining Software Repositories. Association for Computing Machinery, New York, NY, USA, 518--522. Google ScholarDigital Library
- Vish Gain. 2022. Open source developer corrupts his own files, impacting millions. Retrieved Jan 14, 2022 from https://www.siliconrepublic.com/enterprise/github-marak-squires-colors-faker-npm-corrupt-open-sourceGoogle Scholar
- GraphAware. 2022. Hume. Retrieved Jan 17, 2022 from https://graphaware.com/products/hume/Google Scholar
- Graphistry. 2021. Graphistry. Retrieved Jan 17, 2022 from https://www.graphistry.com/Google Scholar
- The PostgreSQL Global Development Group. 2022. PostgreSQL: The World's Most Advanced Open Source Relational Database. Retrieved Jan 19, 2022 from https://www.postgresql.org/Google Scholar
- Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv preprint arXiv:2005.00687 (2020).Google Scholar
- Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv preprint arXiv:1909.09436 (2019).Google Scholar
- Galois Inc. and University of Vermont. 2022. SocialCyberLAGOON: v1.0-SocialCyberM6. Galois, Inc. Google ScholarCross Ref
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).Google Scholar
- Michael Klug and James Bagrow. 2014. Understanding the group dynamics and success of teams. Royal Society Open Science 3 (07 2014). Google ScholarCross Ref
- Python maintainers. 2000. PEP 0 - Index of Python Enhancement Proposals (PEPs). Retrieved Jan 14, 2022 from https://www.python.org/dev/peps/Google Scholar
- Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. 2020. TUDataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020). www.graphlearning.ioGoogle Scholar
- L. Oettershagen, N. M. Kriege, C. Morris, and P. Mutzel. 2020. Temporal graph kernels for classifying dissemination processes. In SIAM International Conference on Data Mining. 496--504.Google Scholar
- International Consortium of Investigative Journalists. 2013. Offshore Leaks Database - How to download this database. Retrieved Jan 14, 2022 from https://offshoreleaks.icij.org/pages/database Different datasets released in 2013, 2016, 2017, 2021.Google Scholar
- Tiago Peixoto. 2022. graph-tool | Efficient network analysis. Retrieved Jan 17, 2022 from https://graph-tool.skewed.de/Google Scholar
- Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Krishnaprasad Thirunarayan, Valerie L. Shalin, and Amit Sheth. 2018. A Quality Type-Aware Annotated Corpus and Lexicon for Harassment Research. In Proceedings of the 10th ACM Conference on Web Science. Association for Computing Machinery, New York, NY, USA, 33--36.Google ScholarDigital Library
- Samuel F. Rosenblatt, Amanda Casari, Sourya Dey, Walt Woods, and Laurent Hébert-Dufresne. 2022. Open is not always welcoming: Examining how toxic interpersonal signals impact and reflect collaboration and prestige in the open source Python language development community. In Sunbelt. Abstract accepted for conference presentation.Google Scholar
- Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20). Association for Computing Machinery, New York, NY, USA, 3125--3132.Google ScholarDigital Library
- P. Rozenshtein, A. Gionis, B. A. Prakash, and J. Vreeken. 2016. Reconstructing an epidemic over time. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 1835--1844.Google Scholar
- Cambridge Semantics. 2022. Anzo Platform. Retrieved Jan 17, 2022 from https://cambridgesemantics.com/anzo-platform/Google Scholar
- SQLAlchemy. 2021. The Python SQL Toolkit and Object Relational Mapper. Retrieved Jan 19, 2022 from https://www.sqlalchemy.org/Google Scholar
- tiangolo. 2021. Typer. Retrieved Jan 24, 2022 from https://typer.tiangolo.com/Google Scholar
- TigerGraph. 2022. TigerGraph Connectors. Retrieved Jan 17, 2022 from https://www.tigergraph.com/connectors/Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000--6010.Google ScholarDigital Library
- Melanie Warrick, Samuel F. Rosenblatt, Jean-Gabriel Young, Amanda Casari, Laurent Hébert-Dufresne, and James P. Bagrow. 2022. The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories. In IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). Accepted, but the conference hasn't taken place yet.Google Scholar
- Igor Scaliante Wiese, José Teodoro Da Silva, Igor Steinmacher, Christoph Treude, and Marco Aurélio Gerosa. 2016. Who is Who in the Mailing List? Comparing Six Disambiguation Heuristics to Identify Multiple Addresses of a Participant. In IEEE International Conference on Software Maintenance and Evolution (ICSME). 345--355. Google ScholarCross Ref
- Evan You. 2021. The Progressive JavaScript Framework. Retrieved Jan 19, 2022 from https://v3.vuejs.org/Google Scholar
- Jean-Gabriel Young, Amanda Casari, Katie McLaughlin, Milo Z. Trujillo, Laurent Hébert-Dufresne, and James P. Bagrow. 2021. Which contributions count? Analysis of attribution in open source. In IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 242--253. Google ScholarCross Ref
Index Terms
- LAGOON: an analysis tool for open source communities
Recommendations
Sustainability of Open Source software communities beyond a fork
First comprehensive analysis of Open Source projects involving a fork.The LibreOffice project, which was forked from the OpenOffice.org project, shows no sign of long-term decline.LibreOffice has attracted the long-term and most active committers in ...
Licenses of Open Source Software and their Economic Values
SAINT-W '05: Proceedings of the 2005 Symposium on Applications and the Internet WorkshopsLicenses of open source software (OSS) are quiet various but can be categorised into three. That is GPL (GNU general Public License) like, LGPL (GNU Lesser general Public License) like, or MPL (Mozilla Public License) like. Although there are numbers of ...
LicenseRec: Knowledge Based Open Source License Recommendation for OSS Projects
ICSE '23: Proceedings of the 45th International Conference on Software Engineering: Companion ProceedingsOpen Source license is a prerequisite for open source software, which regulates the use, modification, redistribution, and attribution of the software. Open source license is crucial to the community development and commercial interests of an OSS ...
Comments