ClassifyHub: An Algorithm to Classify GitHub Repositories

Soll, Marcus; Vosgerau, Malte

doi:10.1007/978-3-319-67190-1_34

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10505))

Included in the following conference series:

Joint German/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz)

2244 Accesses

Abstract

The classification of repositories found on GitHub can be considered as a hard task. However, the solution of this task could be helpful for a lot of different applications (e.g. Recommender Systems). In this paper we present ClassifyHub, an algorithm based on Ensemble Learning developed for the InformatiCup 2017 competition, which is able to tackle this classification problem with high precision and recall. In addition we provide a data set of classified repositories for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Overview of NLPCC2022 Shared Task 5 Track 1: Multi-label Classification for Scientific Literature

Single-Label Multi-modal Field of Research Classification

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

Notes

References

Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749 (2005)
Article Google Scholar
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. Int. J Math. Models Methods Appl. Sci. 1(4), 300–307 (2007)
MathSciNet Google Scholar
Gousios, G., Vasilescu, B., Serebrenik, A., Zaidman, A.: Lean GHTorrent: Github data on demand. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, NY, USA, pp. 384–387 (2014). http://doi.acm.org/10.1145/2597073.2597126
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edn. Prentice Hall (2009)
Google Scholar
Kantardzic, M.: Data Mining: Concepts, Models, Methods and Algorithms, 2nd edn. Wiley, Hoboken (2011)
Book MATH Google Scholar
Kawaguchi, S., Garg, P.K., Matsushita, M., Inoue, K.: Mudablue: an automatic categorization system for open source repositories. J. Syst. Softw. 79(7), 939–953 (2006). http://www.sciencedirect.com/science/article/pii/S0164121205001822. Selected papers from the 11th Asia Pacific Software Engineering Conference (APSEC 2004)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI 1995, vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco (1995). http://dl.acm.org/citation.cfm?id=1643031.1643047
Maskeri, G., Sarkar, S., Heafield, K.: Mining business topics in source code using latent dirichlet allocation. In: Proceedings of the 1st India Software Engineering Conference, ISEC 2008, NY, USA, pp. 113–120 (2008). http://doi.acm.org/10.1145/1342211.1342234
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Ray, B., Posnett, D., Filkov, V., Devanbu, P.: A large scale study of programming languages and code quality in github. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, NY, USA, pp. 155–165 (2014). http://doi.acm.org/10.1145/2635868.2635922
Ricci, F., Rokach, L., Shapira, B.: Introduction to Recommender Systems Handbook, pp. 1–35. Springer US, Boston (2011). http://dx.doi.org/10.1007/978-0-387-85820-3_1
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
MATH Google Scholar
Seni, G., Elder, J.F.: Ensemble methods in data mining: improving accuracy through combining predictions. Synth. Lect. Data Mining Knowl. Discov. 2(1), 1–126 (2010)
Article Google Scholar
Tsay, J., Dabbish, L., Herbsleb, J.: Influence of social and technical factors for evaluating contribution in github. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, NY, USA, pp. 356–366 (2014). http://doi.acm.org/10.1145/2568225.2568315
Ugurel, S., Krovetz, R., Giles, C.L.: What’s the code?: automatic classification of source code archives. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, NY, USA, pp. 632–638 (2002). http://doi.acm.org/10.1145/775047.775141

Download references

Acknowledgements

We would like to thank the organisers, the jury and all participants of the InformatiCup 2017, for which ClassifyHub was developed.

Author information

Authors and Affiliations

University of Hamburg, Mittelweg 177, 20148, Hamburg, Germany
Marcus Soll & Malte Vosgerau

Authors

Marcus Soll
View author publications
You can also search for this author in PubMed Google Scholar
Malte Vosgerau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcus Soll .

Editor information

Editors and Affiliations

Fakultät für Informatik, Technische Universität Dortmund, Dortmund, Germany
Gabriele Kern-Isberner
FB Informatik, TU Darmstadt, Darmstadt, Hessen, Germany
Johannes Fürnkranz
FB Informatik, Universität Koblenz, Koblenz, Rheinland-Pfalz, Germany
Matthias Thimm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soll, M., Vosgerau, M. (2017). ClassifyHub: An Algorithm to Classify GitHub Repositories. In: Kern-Isberner, G., Fürnkranz, J., Thimm, M. (eds) KI 2017: Advances in Artificial Intelligence. KI 2017. Lecture Notes in Computer Science(), vol 10505. Springer, Cham. https://doi.org/10.1007/978-3-319-67190-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-67190-1_34
Published: 19 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67189-5
Online ISBN: 978-3-319-67190-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ClassifyHub: An Algorithm to Classify GitHub Repositories

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Overview of NLPCC2022 Shared Task 5 Track 1: Multi-label Classification for Scientific Literature

Single-Label Multi-modal Field of Research Classification

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

ClassifyHub: An Algorithm to Classify GitHub Repositories

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Overview of NLPCC2022 Shared Task 5 Track 1: Multi-label Classification for Scientific Literature

Single-Label Multi-modal Field of Research Classification

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation