Software Skills Identification: A Multi-Class Classification on Source Code Using Machine Learning

Main Article Content

Dimitris Bamidis
Ilias Kalouptsoglou
Apostolos Ampatzoglou
Alexandros Chatzigeorgiou

Keywords

Machine Learning, Supervised Learning, Multi-class Classification, Neural Network, Transfer Learning, Source code Analysis

Abstract

In the ever-evolving tech industry, accurately assessing the software skills of developers is critical for effective workforce management. This study presents a machine learning approach to classify software development knowledge through source code analysis, focusing on Java-based technologies. A dataset of several source code files from multiple domains of software development was compiled from public repositories and labeled for classification. The high performance achieved in this study, by applying transfer learning, underlines the suitability of pre-trained CodeBERT models for the classification of software skills.


The methodology combined both non-pretrained neural networks and pretrained models to enhance classification accuracy. Results validate the feasibility of using machine learning to identify developers' programming proficiencies, providing a foundation for sophisticated assessment tools. Future work aims to refine classification by incorporating functional task identification and commit-based analysis for a more comprehensive evaluation of coding skills. This study showcases the transformative potential of machine learning in streamlining developer assessments and advancing software engineering methodologies.

Downloads

Download data is not yet available.
Abstract 21 | PDF Downloads 1

References

1. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., ... & Zhou, M. (2020). Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155.
2. Kourtzanidis, S., Chatzigeorgiou, A., & Ampatzoglou, A. (2020, December). RepoSkillMiner: identifying software expertise from GitHub repositories using natural language processing. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (pp. 1353-1357).
3. Zhang, K., Li, G., & Jin, Z. (2022). What does Transformer learn about source code?. arXiv preprint arXiv:2207.08466.
4. Sharma, T., Efstathiou, V., Louridas, P., & Spinellis, D. (2019). On the feasibility of transfer-learning code smells using deep learning. arXiv preprint arXiv:1904.03031.