This paper is under CC 4.0 license. available on arxiv Authors: (1) Mounika Vanamala, Department of Computer Science, University of Wisconsin-Eau Claire, United States; (2) Keith Bryant, Department of Computer Science, University of Wisconsin-Eau Claire, United States; (3) Alex Caravella, Department of Computer Science, University of Wisconsin-Eau Claire, United States. Table of Links Abstract & Introduction Discussions Conclusions, Acknowledgment, and References Conclusion Upon recognizing the significance of cyber security vulnerability controls during the software requirement phase, the CAPEC software vulnerability repository emerged as the most practical repository for this study. The arrangement of attack patterns thus facilitates precise identification and seamless referral back to CAPEC for recommended defense strategies. We define and elaborate on topic modeling, as well as unsupervised and supervised ML methods, showcasing recent research instances and the applicability of these approaches. As our research continues, our efforts will involve the implementation of supervised machine learning. The CAPEC repository provides a prelabeled dataset, a valuable asset for training data set implementation. Supervised ML offers the added benefit of proficiently utilizing metrics to fine-tune the ML process, thus enabling thorough evaluation and process enhancement. A training set for the SRS document must either be crafted or located for supervised ML execution. Given the absence of a comparable research framework employing supervised ML, our future endeavors will assess and compare results stemming from Naïve Bayes and RF ML methodologies. Naïve Bayes showcases statistical prowess across both large and small data sets, making it suitable for the modest data set of SRS documents as well as the larger data set encompassing CAPEC Vulnerabilities. RF's capacity to counteract overfitting aligns well with the intricate data from CAPEC. The algorithm returning the most accurate recommendations for CAPEC attack patterns from an SRS document will be harnessed to deploy an automated tool for result processing and visualization. Acknowledgment Funding Information Author’s Contributions Acquisition of data and analysis and interpretation of data and content written. Keith Bryant and Alex Caravella: Conception and design of the article, intellectual content generation, critically reviewed the article. Keith Bryant, Alex Caravella, and Mounika Vanamala: Contribution to intellectual content ideation and reviewed the article along with the coordination for publication. Mounika Vanamala: Ethics This article is original and contains unpublished material. The corresponding author confirms that all of the other authors have read and approved the manuscript and that no ethical issues are involved. References Al-Sabahi, K., Zuping, Z., & Kang, Y. (2018). Latent semantic analysis approach for document summarization based on word embeddings. arXiv preprint arXiv:1807.02748. https://doi.org/10.3837/tiis.2019.01.015 Alyami, H., Nadeem, M., Alharbi, A., Alosaimi, W., Ansari, M. T. J., Pandey, D., ... & Khan, R. A. (2021). The evaluation of software security through quantum computing techniques: A durability perspective. Applied Sciences, 11(24), 11784. https://doi.org/10.3390/app112411784 Asim, M. N., Ghani, M. U., Ibrahim, M. A., Mahmood, W., Dengel, A., & Ahmed, S. (2021). Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification. Neural Computing and Applications, 33, 5437-5469. https://doi.org/10.1007/s00521-020-05321-8 Bedi, G. (2018). A guide to Text Classification (NLP) using SVM and Naive Bayes with Python. Medium, Nov. Bellaouar, S., Bellaouar, M. M., & Ghada, I. E. (2021, February). Topic modeling: Comparison of LSA and LDA on scientific publications. In 2021 4th International Conference on Data Storage and Data Engineering (pp. 59-64). https://doi.org/10.1145/3456146.3456156 CISA. (2021). c? | CISA. https://www.cisa.gov/uscert/ncas/tips/ST04-001 CVE. (2022). https://cve.mitre.org Delli, U., & Chang, S. (2018). Automated process monitoring in 3D printing using supervised machine learning. Procedia Manufacturing, 26, 865-870. https://doi.org/10.1016/j.promfg.2018.07.111 Guo, Y., & Li, J. (2021). Distributed Latent Dirichlet Allocation on Streams. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(1), 1-20. https://doi.org/10.1145/3451528 Prasad, S. G., Badrinarayanan, M. K., & Sharmila, V. C. (2022). Efficacy and Security Effectiveness: Key Parameters in Evaluation of Network Security. International Journal of Performability Engineering, 18(4), 282. https://doi.org/10.23940/ijpe.22.04.p6.282288 IBM. (2019). What is machine learning? https://www.ibm.com/topics/machinelearning?lnk=fle Mallet, J., Pryor, L., Dave, R., Seliya, N., Vanamala, M., & Sowells-Boone, E. (2022, March). Hold on and swipe: A touch-movement based continuous authentication schema based on machine learning. In 2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML) (pp. 442-447). IEEE. https://doi.org/10.1109/CACML55074.2022.00081 Kanakogi, K., Washizaki, H., Fukazawa, Y., Ogata, S., Okubo, T., Kato, T., ... & Yoshioka, N. (2022). Comparative Evaluation of NLP-Based Approaches for Linking CAPEC Attack Patterns from CVE Vulnerability Information. Applied Sciences, 12(7), 3400. https://doi.org/10.3390/app12073400 Kim, D., & Im, T. (2022). A Systematic Review of Virtual Reality-Based Education Research Using Latent Dirichlet Allocation: Focus on Topic Modeling Technique. Mobile Information Systems, 2022. https://doi.org/10.1155/2022/1201852 Krzeszewska, U., Poniszewska-Marańda, A., & Ochelska-Mierzejewska, J. (2022). Systematic comparison of vectorization methods in classification context. Applied Sciences, 12(10), 5119. https://doi.org/10.3390/app12105119 León-Paredes, G. A., Barbosa-Santillán, L. I., & SánchezEscobar, J. J. (2017). A heterogeneous system based on latent semantic analysis using GPU and multiCPU. Scientific Programming, 2017. https://doi.org/10.1155/2017/8131390 Livingston, F. (2005). Implementation of Breiman’s random forest machine learning algorithm. ECE591Q Machine Learning Journal Paper, 1-13. Macsai, D. 2012. The most important company you’ve never heard of. 1 Minute Read. Fast Company. https://www.fastcompany.com/3017927/30mitre McAllister, P., Zheng, H., Bond, R., & Moorhead, A. (2018). Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets. Computers in Biology and Medicine, 95, 217-233. https://doi.org/10.1016/j.compbiomed.2018.02.008 Mounika, V., Yuan, X., & Bandaru, K. (2019, December). Analyzing CVE database using unsupervised topic modelling. In 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (pp. 72-77). IEEE. https://doi.org/10.1109/CSCI49370.2019.00019 MITRE ATT&CK®. (2022). https://attack.mitre.org Mohamed, A. E. (2017). Comparative study of four supervised machine learning techniques for classification. International Journal of Applied, 7(2), 1-15. https://www.ijastnet.com/journal/index/859 NIST. (2022). About NIST. https://www.nist.gov/about-nist Prakash, A., Singh, N. K., & Saha, S. K. (2022). Automatic extraction of similar poetry for study of literary texts: An experiment on Hindi poetry. ETRI Journal, 44(3), 413-425. https://doi.org/10.4218/etrij.2019-0396 Rahman, A. S., Shamrat, F. J. M., Tasnim, Z., Roy, J., & Hossain, S. A. (2019). A comparative study on liver disease prediction using supervised machine learning algorithms. International Journal of Scientific & Technology Research, 8(11), 419-422. http://www.ijstr.org/final-print/nov2019/AComparative-Study-On-Liver-Disease-PredictionUsing-Supervised-Machine-LearningAlgorithms.pdf Rustam, F., A. Reshi, S. Mehmood, S. Ullah, B. On, W. Aslam and G. Choi. 2020. COVID-19 Future Forecasting Using Supervised Machine Learning Models. IEEE Access, pp: 101489-99. https://doi.org/10.1109/ACCESS.2020.2997311 Sanguri, K., Bhuyan, A., & Patra, S. (2020). A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain. Scientometrics, 125(1), 233-269. https://doi.org/10.1007/s11192-020-03608-0 Schrider, D. R., & Kern, A. D. (2018). Supervised machine learning for population genetics: a new paradigm. Trends in Genetics, 34(4), 301-312. https://doi.org/10.1016/j.tig.2017.12.005 Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. https://www.cs.huji.ac.il/~shais/UnderstandingMach ineLearning/ Sharma, C., Sharma, S., & Sakshi. (2022). Latent DIRICHLET allocation (LDA) based information modelling on BLOCKCHAIN technology: A review of trends and research patterns used in integration. Multimedia Tools and Applications, 81(25), 36805-36831. https://doi.org/10.1007/s11042-022-13500-z Siddiqui, N., Dave, R., Vanamala, M., & Seliya, N. (2022). Machine and deep learning applications to mouse dynamics for continuous user authentication. Machine Learning and Knowledge Extraction, 4(2), 502-518. https://doi.org/10.3390/make4020023 Sweeney, E. M., Vogelstein, J. T., Cuzzocreo, J. L., Calabresi, P. A., Reich, D. S., Crainiceanu, C. M., & Shinohara, R. T. (2014). A comparison of supervised machine learning algorithms and feature vectors for MS lesion segmentation using multimodal structural MRI. PloS One, 9(4), e95753. https://doi.org/10.1371/journal.pone.0095753 Uddin, S., Khan, A., Hossain, M. E., & Moni, M. A. (2019). Comparing different supervised machine learning algorithms for disease prediction. BMC Medical Informatics and Decision Making, 19(1), 1-16. https://doi.org/10.1186/s12911-019-1004-8 Ullah, F., Wang, J., Farhan, M., Jabbar, S., Naseer, M. K., & Asif, M. (2020). LSA based smart assessment methodology for SDN infrastructure in IoT environment. International Journal of Parallel Programming, 48, 162-177. https://doi.org/ 10.1007/s10766-018-0570-1 Ullah, F., Jabbar, S., & Mostarda, L. (2021). An intelligent decision support system for software plagiarism detection in academia. International Journal of Intelligent Systems, 36(6), 2730-2752 https://doi.org/10.1002/int.22399 Vanamala, M., Gilmore, J., Yuan, X., & Roy, K. (2020a, December). Recommending attack patterns for software requirements document. In 2020 International Conference on Computational Science and Computational Intelligence (CSCI) (pp. 1813-1818). IEEE. https://doi.org/10.1109/CSCI51800.2020.00334 Vanamala, M., Yuan, X., & Roy, K. (2020b, August). Topic modeling and classification of Common Vulnerabilities and Exposures database. In 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD) (pp. 1-5). IEEE. https://doi.org/10.1109/icABCD49160.2020.9183814 Zhu, L., He, Y., & Zhou, D. (2020). A neural generative model for joint learning topics and topic-specific word embeddings. Transactions of the Association for Computational Linguistics, 8, 471-485. https://doi.org/10.1162/tacl_a_00326

Software Repositories and Machine Learning Research in Cyber Security: Conclusions, Acknowledgment

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

15 Common Types of Unethical Behavior Found in Open-Source Projects

108 Stories To Learn About Cybersecurity Tips

10 Cybersecurity Books Every Business Owner Should Read

10 Best Practices for Securing Your API

122 Stories To Learn About Cybercrime

3 Cybersecurity Priorities for 2021: Threat Fatigue; Remote Work; Budget

15 Common Types of Unethical Behavior Found in Open-Source Projects

108 Stories To Learn About Cybersecurity Tips

10 Cybersecurity Books Every Business Owner Should Read

10 Best Practices for Securing Your API

122 Stories To Learn About Cybercrime

3 Cybersecurity Priorities for 2021: Threat Fatigue; Remote Work; Budget

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps