paint-brush
Software Repositories and Machine Learning Research in Cyber Security: Discussionsby@escholar
149 reads

Software Repositories and Machine Learning Research in Cyber Security: Discussions

tldt arrow

Too Long; Didn't Read

A paper on how machine learning and topic modeling enhance early cyber threat detection in software development, leveraging CAPEC and CVE repositories.
featured image - Software Repositories and Machine Learning Research in Cyber Security: Discussions
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Mounika Vanamala, Department of Computer Science, University of Wisconsin-Eau Claire, United States;

(2) Keith Bryant, Department of Computer Science, University of Wisconsin-Eau Claire, United States;

(3) Alex Caravella, Department of Computer Science, University of Wisconsin-Eau Claire, United States.

Abstract & Introduction

Discussions

Conclusions, Acknowledgment, and References

Discussions

Semantics of words have a crucial role in properly categorizing words through ML. Two different words can be processed into the same word, which potentially provides inaccurate classification. One example is the preprocessing of the words desert and deserted, these words both become desert. The meaning of the word deserted is lost. It would be essential for an ML model to be effective in semantic analysis if it were to make recommendations upon relevant vulnerabilities, utilizing the CAPEC database. The next discussion is the consideration of implementing an unsupervised, supervised, or semi-supervised ML model. The goal of this research would be to compare keywords from an SRS document to the keywords of CAPEC vulnerabilities.


Unsupervised Machine Learning (ML) algorithms find their primary utility in tasks involving the segregation of data into clusters, uncovering underlying data relationships, and reducing dimensionality. For instance, dimensionality reduction becomes a valuable tool when dealing with extensive datasets like the large CAPEC dataset, as it aims to streamline data while preserving its integrity.


In the realm of text analysis, research on Latent Dirichlet Allocation (LDA) uncovered substantial adaptation requirements to achieve satisfactory outcomes, largely due to semantic limitations. On the other hand, the Latent Semantic Analysis (LSA) algorithm is designed to capture semantics and establish connections between vectors that words are segmented into. Over time, LSA has frequently been coupled with techniques such as Singular Value Decomposition (SVD) or other intricate algorithms to enhance its effectiveness. It's important to note that evaluating the usefulness of unsupervised methods, in general, can be challenging due to the absence of well-defined metrics to measure model accuracy. This lack of clear metrics adds complexity to the interpretation of results, making it more intricate to discern the quality of outcomes generated by unsupervised ML approaches.


Supervised ML is a less complex process and requires fewer tools than unsupervised ML (IBM, 2019). Supervised ML uses a training dataset and validation techniques to derive accurate results in a timelier manner, compared to unsupervised ML. Unsupervised ML works by clustering objects into like groups, identified by the algorithm. The largest limitation for supervised ML requires obtaining the training data set to prep the implemented algorithm. Supervised ML also is significantly more proficient at obtaining metrics for the accuracy of results.