paint-brush
Fraud Anomaly Model: A Powerful ML Tool for Detecting Unusual Activityby@bttminhphuc
128 reads

Fraud Anomaly Model: A Powerful ML Tool for Detecting Unusual Activity

by Phuc TranJuly 28th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This article delves into the significance of the Fraud Anomaly Model and explores how ML is employed to tackle the challenges posed by fraudulent activities.
featured image - Fraud Anomaly Model: A Powerful ML Tool for Detecting Unusual Activity
Phuc Tran HackerNoon profile picture

I - Introduction

Fraud Anomaly Model, a sophisticated technique utilized in fraud detection, plays a crucial role in identifying suspicious patterns and data points that may indicate fraudulent activity.


The model is specifically designed to learn from historical data and leverage that knowledge to spot potential fraud in real time.


As fraudulent activities continue to evolve and grow in complexity, traditional rule-based methods and manual reviews fall short of keeping up with new fraud trends.


This article delves into the significance of the Fraud Anomaly Model and explores how ML is employed to tackle the challenges posed by fraudulent activities.

II - Understanding Fraud and Its Detection

Fraud, particularly in the context of chargebacks or potential chargebacks that result from unauthorized transactions, poses significant financial losses and security risks for businesses and individuals alike.

Current Fraud structure of Ecommerce companies

Traditional fraud prevention tools include rule-based systems, which offer flexibility for specific users or industries, and manual reviews by human analysts, which deliver high accuracy but lack scalability to handle large transaction volumes.


Additionally, fraud machine learning models, while scalable, often struggle to accurately detect new fraud patterns that have not been previously encountered (i.e., no historical chargebacks with similar patterns).


The Fraud Anomaly Model presents a solution to two major problems faced in fraud detection:

2 remain problems

  • Detecting Fraud Trends Faster: The real challenge in fraud detection lies in promptly identifying trends in accepted transactions that may indicate fraudulent activity. Anomaly models leverage machine learning to efficiently analyze vast amounts of data, making it possible to detect unusual patterns quickly.


  • Spotting New Fraud Attacks Easier: As fraudsters continually adapt their tactics to evade detection, traditional models struggle to keep up with these novel attacks. Anomaly models excel at identifying new fraud attacks, even without historical chargeback data exhibiting similar patterns.

III - The Working Principles of the Fraud Anomaly Model

Anomaly models are based on the analysis of data points, aiming to identify patterns that deviate significantly from the norm. Each data point is assigned an anomaly score based on its dissimilarity from the rest of the data.


Higher anomaly scores indicate a higher likelihood of potential fraud, signaling the need for further investigation.

Key Steps in Implementing a Fraud Anomaly Model:

1. Dataset, Features, and Target - EDA

Exploratory Data Analysis (EDA) is an approach to visualizing, summarizing, and interpreting the information that is hidden in rows and column format. In this case, I’m taking my sample dataset and visualizing the results and the meaning of the results.


  • Dataset: A collection of instances is a dataset within the full year of 2019.


  • Feature: A single column (It is a component of an observation) of data is called a feature. Currently, the model is using 55 features.


  • Target: Fraud is 1 and Non-Fraud is 0

Sample target visualization of the dataset

HIGHLIGHTS:

  • The distribution of Target (Fraud or not) is highly imbalanced.


  • In some segments, fraud is more prevalent than in others, such as markets by Affiliate and Payment method.


  • During Exploratory Data Analysis (EDA), it is essential to observe the correlations, which become crucial factors for data scientists to adjust the algorithm.

Features correlation

For example, during the exploratory data analysis (EDA), I observed correlations among various features:


  • Features with high correlations (greater than or equal to 0.7):
    • OrderTotalAmount and FlightProductCost

    • TotalNumberOfLegs, TotalNumberOfInboundSegments, and NumberOfStopOvers

    • TotalNumberOfPassengers and TotalNumberOfAdults


Additionally, the EDA delved into the numerical features to gain a better understanding of the dataset. Here are some examples:


  1. Total Amount EUR (Payment) within Fraud (1) and Non-Fraud (0):

    • The highest value is approximately €25,000.

    • Median Order value for non-fraudulent transactions: €480

    • Median Order Value for fraudulent transactions: €780


  2. Leadtime (Gap time from the purchase date to Flight departure date) within Fraud (1) and Non-Fraud (0):


    • The highest value is approximately 500 days, and the lowest value is 0 days.

    • Median lead time for non-fraudulent transactions: 30 days

    • Median lead time for fraudulent transactions: 8 days


By analyzing these correlations and numerical features, we can gain valuable insights into the dataset, which will aid in building an effective Fraud Anomaly Model.

Numerical feature EDA

It performs to define and refine our important features variable selection, that will be used in our model

2. Model Training: Isolation Forest Model

The Isolation Forest algorithm is a popular choice for detecting anomalies in data. It creates a forest of decision trees, where each tree isolates a data point by randomly selecting a feature and generating split values.

The number of splits required to isolate a data point serves as its anomaly score. Lower split counts indicate higher anomaly scores, implying a more anomalous data point.

Isolate the inliner

Isolate the outlier

There is the tendency that in a dataset will be EASIER to separate an abnormal point from the rest of the sample, compared to normal points.


In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting a feature and then randomly selecting a split value for the feature, between the minimum and maximum values allowed for that attribute.

3. Model Testing and Prediction

The model's performance is evaluated using recall and precision metrics. Recall measures the fraud coverage rate (i.e., the percentage of detected anomalies that are actual fraud cases). Precision, or model accuracy, assesses the ratio of true anomalies correctly identified by the model.


Some notes to fit the training process

  • Train data is used during the learning process and to fit the model


  • Test data is used to provide an unbiased evaluation of a final model fit

threshold set

  • Recall (Fraud Coverage Rate): This metric measures the proportion of actual fraud cases that were correctly identified by the model. It is calculated as the ratio of the number of true fraud cases detected to the total number of actual fraud cases.


Recall = True Positive / (True Positive + False Negative)= Anomaly and Fraud count / Total Fraud Count


  • Precision (Model Accuracy): Precision measures the proportion of flagged anomalies that are true fraud cases out of all flagged anomalies. It helps assess the accuracy of the model in correctly classifying anomalies.


Precision = True Positive / (True Positive + False Positive)=Anomaly and Fraud count / Total Anomaly Count


  • F1 Score: The F1 score is the harmonic mean of recall and precision. It provides a balance between the two metrics, offering a comprehensive evaluation of the model's overall performance.


F1 Score = 2 * (Precision * Recall) / (Precision + Recall)


Dummy model comparision

Compare with a Dummy Classifier Model: To establish a baseline for comparison, the model's performance is compared with that of a Dummy Classifier. The Dummy Classifier generates predictions based on the class distribution of the training data.


  • For example, in a dataset with a fraud ratio of 0.9%, the Dummy Classifier will randomly classify 0.9% of the cases as anomalies.


  • The Fraud Anomaly Model is expected to outperform the Dummy Classifier in terms of recall, precision, and F1 score.

By following these comprehensive evaluation steps, organizations can gain valuable insights into the effectiveness of their Fraud Anomaly Model, allowing them to make informed decisions and strengthen their fraud detection strategies.


IV - Conclusion

The Fraud Anomaly Model represents a powerful and scalable approach to combat the ever-evolving nature of fraudulent activities.


By harnessing the capabilities of machine learning, anomaly models can quickly detect fraud trends and identify new attack patterns that traditional rule-based systems and manual reviews might miss.

As fraud continues to pose a significant threat to businesses and consumers, the adoption of advanced fraud detection techniques like the Fraud Anomaly Model becomes increasingly vital in safeguarding financial interests and data security.