paint-brush
Software Repositories and Machine Learning Research in Cyber Security: Conclusions, Acknowledgmentby@escholar
372 reads
372 reads

Software Repositories and Machine Learning Research in Cyber Security: Conclusions, Acknowledgment

tldt arrow

Too Long; Didn't Read

A paper on how machine learning and topic modeling enhance early cyber threat detection in software development, leveraging CAPEC and CVE repositories.
featured image - Software Repositories and Machine Learning Research in Cyber Security: Conclusions, Acknowledgment
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Mounika Vanamala, Department of Computer Science, University of Wisconsin-Eau Claire, United States;

(2) Keith Bryant, Department of Computer Science, University of Wisconsin-Eau Claire, United States;

(3) Alex Caravella, Department of Computer Science, University of Wisconsin-Eau Claire, United States.

Abstract & Introduction

Discussions

Conclusions, Acknowledgment, and References

Conclusion

Upon recognizing the significance of cyber security vulnerability controls during the software requirement phase, the CAPEC software vulnerability repository emerged as the most practical repository for this study. The arrangement of attack patterns thus facilitates precise identification and seamless referral back to CAPEC for recommended defense strategies. We define and elaborate on topic modeling, as well as unsupervised and supervised ML methods, showcasing recent research instances and the applicability of these approaches. As our research continues, our efforts will involve the implementation of supervised machine learning. The CAPEC repository provides a prelabeled dataset, a valuable asset for training data set implementation. Supervised ML offers the added benefit of proficiently utilizing metrics to fine-tune the ML process, thus enabling thorough evaluation and process enhancement. A training set for the SRS document must either be crafted or located for supervised ML execution. Given the absence of a comparable research framework employing supervised ML, our future endeavors will assess and compare results stemming from Naïve Bayes and RF ML methodologies. Naïve Bayes showcases statistical prowess across both large and small data sets, making it suitable for the modest data set of SRS documents as well as the larger data set encompassing CAPEC Vulnerabilities. RF's capacity to counteract overfitting aligns well with the intricate data from CAPEC. The algorithm returning the most accurate recommendations for CAPEC attack patterns from an SRS document will be harnessed to deploy an automated tool for result processing and visualization.

Acknowledgment

Funding Information


Author’s Contributions


Keith Bryant and Alex Caravella: Acquisition of data and analysis and interpretation of data and content written.


Keith Bryant, Alex Caravella, and Mounika Vanamala: Conception and design of the article, intellectual content generation, critically reviewed the article.


Mounika Vanamala: Contribution to intellectual content ideation and reviewed the article along with the coordination for publication.

Ethics

This article is original and contains unpublished material. The corresponding author confirms that all of the other authors have read and approved the manuscript and that no ethical issues are involved.

References

Al-Sabahi, K., Zuping, Z., & Kang, Y. (2018). Latent semantic analysis approach for document summarization based on word embeddings. arXiv preprint arXiv:1807.02748. https://doi.org/10.3837/tiis.2019.01.015


Alyami, H., Nadeem, M., Alharbi, A., Alosaimi, W., Ansari, M. T. J., Pandey, D., ... & Khan, R. A. (2021). The evaluation of software security through quantum computing techniques: A durability perspective. Applied Sciences, 11(24), 11784.

https://doi.org/10.3390/app112411784


Asim, M. N., Ghani, M. U., Ibrahim, M. A., Mahmood, W., Dengel, A., & Ahmed, S. (2021). Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification. Neural Computing and Applications, 33, 5437-5469. https://doi.org/10.1007/s00521-020-05321-8


Bedi, G. (2018). A guide to Text Classification (NLP) using SVM and Naive Bayes with Python. Medium, Nov.


Bellaouar, S., Bellaouar, M. M., & Ghada, I. E. (2021, February). Topic modeling: Comparison of LSA and LDA on scientific publications. In 2021 4th International Conference on Data Storage and Data Engineering (pp. 59-64). https://doi.org/10.1145/3456146.3456156


CISA. (2021). c? | CISA. https://www.cisa.gov/uscert/ncas/tips/ST04-001


CVE. (2022). https://cve.mitre.org


Delli, U., & Chang, S. (2018). Automated process monitoring in 3D printing using supervised machine learning. Procedia Manufacturing, 26, 865-870. https://doi.org/10.1016/j.promfg.2018.07.111


Guo, Y., & Li, J. (2021). Distributed Latent Dirichlet Allocation on Streams. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(1), 1-20. https://doi.org/10.1145/3451528


Prasad, S. G., Badrinarayanan, M. K., & Sharmila, V. C. (2022). Efficacy and Security Effectiveness: Key Parameters in Evaluation of Network Security. International Journal of Performability Engineering, 18(4), 282. https://doi.org/10.23940/ijpe.22.04.p6.282288


IBM. (2019). What is machine learning? https://www.ibm.com/topics/machinelearning?lnk=fle


Mallet, J., Pryor, L., Dave, R., Seliya, N., Vanamala, M., & Sowells-Boone, E. (2022, March). Hold on and swipe: A touch-movement based continuous authentication schema based on machine learning. In 2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML) (pp. 442-447). IEEE. https://doi.org/10.1109/CACML55074.2022.00081


Kanakogi, K., Washizaki, H., Fukazawa, Y., Ogata, S., Okubo, T., Kato, T., ... & Yoshioka, N. (2022). Comparative Evaluation of NLP-Based Approaches for Linking CAPEC Attack Patterns from CVE Vulnerability Information. Applied Sciences, 12(7), 3400. https://doi.org/10.3390/app12073400


Kim, D., & Im, T. (2022). A Systematic Review of Virtual Reality-Based Education Research Using Latent Dirichlet Allocation: Focus on Topic Modeling Technique. Mobile Information Systems, 2022. https://doi.org/10.1155/2022/1201852


Krzeszewska, U., Poniszewska-Marańda, A., & Ochelska-Mierzejewska, J. (2022). Systematic comparison of vectorization methods in classification context. Applied Sciences, 12(10), 5119. https://doi.org/10.3390/app12105119


León-Paredes, G. A., Barbosa-Santillán, L. I., & SánchezEscobar, J. J. (2017). A heterogeneous system based on latent semantic analysis using GPU and multiCPU. Scientific Programming, 2017. https://doi.org/10.1155/2017/8131390


Livingston, F. (2005). Implementation of Breiman’s random forest machine learning algorithm. ECE591Q Machine Learning Journal Paper, 1-13.


Macsai, D. 2012. The most important company you’ve never heard of. 1 Minute Read. Fast Company. https://www.fastcompany.com/3017927/30mitre


McAllister, P., Zheng, H., Bond, R., & Moorhead, A. (2018). Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets. Computers in Biology and Medicine, 95, 217-233. https://doi.org/10.1016/j.compbiomed.2018.02.008


Mounika, V., Yuan, X., & Bandaru, K. (2019, December). Analyzing CVE database using unsupervised topic modelling. In 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (pp. 72-77). IEEE. https://doi.org/10.1109/CSCI49370.2019.00019


MITRE ATT&CK®. (2022). https://attack.mitre.org


Mohamed, A. E. (2017). Comparative study of four supervised machine learning techniques for classification. International Journal of Applied, 7(2), 1-15. https://www.ijastnet.com/journal/index/859


NIST. (2022). About NIST. https://www.nist.gov/about-nist


Prakash, A., Singh, N. K., & Saha, S. K. (2022). Automatic extraction of similar poetry for study of literary texts: An experiment on Hindi poetry. ETRI Journal, 44(3), 413-425. https://doi.org/10.4218/etrij.2019-0396


Rahman, A. S., Shamrat, F. J. M., Tasnim, Z., Roy, J., & Hossain, S. A. (2019). A comparative study on liver disease prediction using supervised machine learning algorithms. International Journal of Scientific & Technology Research, 8(11), 419-422. http://www.ijstr.org/final-print/nov2019/AComparative-Study-On-Liver-Disease-PredictionUsing-Supervised-Machine-LearningAlgorithms.pdf


Rustam, F., A. Reshi, S. Mehmood, S. Ullah, B. On, W. Aslam and G. Choi. 2020. COVID-19 Future Forecasting Using Supervised Machine Learning Models. IEEE Access, pp: 101489-99. https://doi.org/10.1109/ACCESS.2020.2997311


Sanguri, K., Bhuyan, A., & Patra, S. (2020). A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain. Scientometrics, 125(1), 233-269. https://doi.org/10.1007/s11192-020-03608-0


Schrider, D. R., & Kern, A. D. (2018). Supervised machine learning for population genetics: a new paradigm. Trends in Genetics, 34(4), 301-312. https://doi.org/10.1016/j.tig.2017.12.005


Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. https://www.cs.huji.ac.il/~shais/UnderstandingMach ineLearning/


Sharma, C., Sharma, S., & Sakshi. (2022). Latent DIRICHLET allocation (LDA) based information modelling on BLOCKCHAIN technology: A review of trends and research patterns used in integration. Multimedia Tools and Applications, 81(25), 36805-36831. https://doi.org/10.1007/s11042-022-13500-z


Siddiqui, N., Dave, R., Vanamala, M., & Seliya, N. (2022). Machine and deep learning applications to mouse dynamics for continuous user authentication. Machine Learning and Knowledge Extraction, 4(2), 502-518. https://doi.org/10.3390/make4020023


Sweeney, E. M., Vogelstein, J. T., Cuzzocreo, J. L., Calabresi, P. A., Reich, D. S., Crainiceanu, C. M., & Shinohara, R. T. (2014). A comparison of supervised machine learning algorithms and feature vectors for MS lesion segmentation using multimodal structural MRI. PloS One, 9(4), e95753. https://doi.org/10.1371/journal.pone.0095753


Uddin, S., Khan, A., Hossain, M. E., & Moni, M. A. (2019). Comparing different supervised machine learning algorithms for disease prediction. BMC Medical Informatics and Decision Making, 19(1), 1-16. https://doi.org/10.1186/s12911-019-1004-8


Ullah, F., Wang, J., Farhan, M., Jabbar, S., Naseer, M. K., & Asif, M. (2020). LSA based smart assessment methodology for SDN infrastructure in IoT environment. International Journal of Parallel Programming, 48, 162-177. https://doi.org/ 10.1007/s10766-018-0570-1


Ullah, F., Jabbar, S., & Mostarda, L. (2021). An intelligent decision support system for software plagiarism detection in academia. International


Journal of Intelligent Systems, 36(6), 2730-2752 https://doi.org/10.1002/int.22399


Vanamala, M., Gilmore, J., Yuan, X., & Roy, K. (2020a, December). Recommending attack patterns for software requirements document. In 2020 International Conference on Computational Science and Computational Intelligence (CSCI) (pp. 1813-1818). IEEE. https://doi.org/10.1109/CSCI51800.2020.00334


Vanamala, M., Yuan, X., & Roy, K. (2020b, August). Topic modeling and classification of Common Vulnerabilities and Exposures database. In 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD) (pp. 1-5). IEEE. https://doi.org/10.1109/icABCD49160.2020.9183814


Zhu, L., He, Y., & Zhou, D. (2020). A neural generative model for joint learning topics and topic-specific word embeddings. Transactions of the Association for Computational Linguistics, 8, 471-485. https://doi.org/10.1162/tacl_a_00326