Home » Your Daily TechBytes » Anomaly Detection in Financial Transactions
Anomaly Detection in Financial Transactions

A case study to Shift from Unsupervised Learning to Supervised

In the financial data domain, whenever we have business goals to be met precisely as the stakeholder’s demand, there might be the chance that we don’t have the data domain aligned to those goals.

Sometimes we have data without labels or the outcomes that can be utilized in preparing the models. Several workarounds to address this problem totally depend upon the KPIs we are addressing. Some data experts believe that simple rule-based KPIs are fine with intense supervision of financial analysts can help and then, over time, improve the rules.

Some devise semi-supervised machine learning models and artificial intelligence solutions where they get a small portion of data verified and annotated, then use it to train the models. Suppose we don’t have any kind of supervision or annotation in the future. 

In that case, Anomaly detection algorithms along with supervised models can replace the supervised models temporarily quite for some time.  

What is Anomaly detection? 

Anomaly detection refers to the process of detecting unexpected, unprecedented, and rare events in a data set. These are commonly known as anomalous events. Finding data anomalies is helpful because it paints a clear picture of the data. It is a fundamental concept behind many applications, mainly in real-time apps. Spotting anomalies in health, infrastructure, and processes is crucial for security and fintech apps.  

Some notable Anomaly Detection algorithms: 

  • Isolation Forest (ISO Forest)

    It isolates observations through a random selection of features. And then randomly selects a split value in-between maximum and minimum values of the chosen feature. Recursive partitioning or division is presentable in the form of a tree structure. The total number of splits needed to isolate a sample equals the path length from the root to the terminating node. 

  • Cluster-based Local Outlier Factor (CBLOS)

    CBLOF takes data as the input and generates a cluster model through a clustering algorithm. It categorizes one cluster into several small and large clusters with the help of alpha and beta parameters. It then calculates the anomaly score by considering the cluster size of the point and its distance to the nearest large cluster. 

  • Histogram-based Outlier Detection (HBOS)

    It is another efficient yet unsupervised method for anomaly detection. HBOS calculates the degree of outlyingness based on feature independence and histograms.  

How to shift from Unsupervised model to Supervised? 

The method intends to separate the anomalies and separately create an anomaly model for every KPI so that we can create labels for every KPI. The model will tell us whether the specific transaction is risky in light of that KPI. The transaction can be difficult in light of multiple KPIs, or it can be difficult in one KPI and non-risky in another. 

Data preparation steps are as similar as common data science problems; that is why a standard layer should be prepared to be used in unsupervised model preparation and in supervised model preparation. Data cleaning to Feature extraction is generally a common layer, then unsupervised models and supervised models can utilize them to develop the models. 

Anomaly detection unsupervised model to supervised

Data Extraction and Preparation 

The data lies typically in databases or cloud platforms in different settings. One must design the data pulling job carefully to include the important columns. This is a very critical job for Data engineers as well as Data scientists. If we dive into this topic, our original focus of this blog will shift as it is a very challenging problem, and its details are overwhelming. 

Once the data is available to the data scientist, the first and foremost step is to realize the column types and their nature. There might be some columns that are usable for later stages, like data, time, etc., or maybe obliterated. One can then identify and correct the types.

Data cleaning includes cleaning ‘nan’ values, data imputation column to column basis, converting numeric to float, removing outliers, and performing data discretization if necessary. 

Some other steps: 

  1. Convert known category variables to categorical 
  2. Add ‘year’ and ‘month’ features for later usage 
  3. Run Pandas’ describe’ method to pull basic stats about the variables 
  4. Print top values and other information for variables 

Correlation Analysis 

Sample the data for further testing and skew measures. In this way, the distribution of the column can be accurately visualized and corrected if skewed not evenly.

Anomaly detection in correlation analysis

Run correlations of categorical and numeric features to identify and remove overly correlated features. We must carefully select the correlated columns. There might be a case where two columns correlated strongly; we might choose the wrong column which does not align with a business goal. Besides correlation analysis, columns must be selected according to the model’s requirement and the use case.

Anomaly detection in correlation analysis - dealing with textual features

Dealing with textual features. Here are the steps to follow: 

Fix the text features with standard tfidf vectorizer parameters¶ 

  • Review the vocabulary for abbreviations that can be normalized 
  • apply case, space, and punctuation normalizations 

Normalize reduced variable set based on correlation and skew 

  • Convert categorical features to encoded vectors for machine-readable input 
  • Convert numeric features to scaled versions so that numeric features all cover the same range of values between 0 and 1 

Also, run the Principal Components Analysis Explained Variance measure and graph to choose the component count cut-off 

Choose number of PCA components based on PCA explained variance intersection at 95% variance.

Anomaly detection

Identify the most weighted contributing feature and category from each PCA component. Also, you can always backtrack and see which columns are preferable to use in models. PCA has simplified this feature selection process by this mechanism.

PCA feature selection process

 Train Anomaly detection Algorithms 

For every KPI, train separate anomaly detection algorithms so that we can identify the different risks at different levels.  

It is always a good practice to separate Train, Validation, and Test Set. But from where we would identify the correct working of the model. Remember if we can make simple rules of KPIs and mark the data with those rules. These can serve as weak comparison benchmarks. So, in this way, we can always verify whether the model is performing the theme in Validation and Test Set training. Moreover, apart from the PYOD and sklearn library given parameters to check the model performance, this additional step can ensure the model is trained accurately.

Train anomaly detection algorithms

Training supervised models 

Through the data and its nature, ml models can be easily selected. Therefore, the metrics must be precision, recall, and F1 to measure risky/non-risky behavior against every KPI. Models like XGBoost, Neural Nets, Random Forest, Gradient Boosting Classifier, and Support Vector Machine (SVM) can be tried to check which ones are the better candidate.  

Moreover, two libraries can be compelling in optimizing the models and explaining the model’s features. Optuna is very powerful tool for selecting the optimum hyperparameters to maximize the classification accuracy.  

The other tool, SHAP (SHapely Additive exPlanations), is a tool of explainable AI to explain the model’s outputs.

shap AI tool

So, the important columns and their impact are identifiable through the SHAP, and we can eliminate those columns with low impact in modeling.

training supervised model

Advantages of using the Supervised model instead of Anomaly models 

We are not using anomaly detection algorithms in production. Instead, we intend to use the labels generated and train a multi-label machine learning model. The reasons for that are: 

  1. Instead of several models, only one multi-label will address all KPIs, and the computation cost would decrease. 
  2. Training one model will be beneficial once we have the human-level annotation. Also, the shift from unsupervised to supervised learning will be much easier. The labels will then easily fine-tune or re-train one model instead of several anomaly models. 
  3. Multi-label classification will open doors to several varieties of models for each KPI instead of limited choice in clustering or anomaly algorithms. 
  4. Metrics will be adequately defined, and you can measure the accuracy for each label much more quickly and easily. 


Abdul Rehman

Abdul Rehman is Senior Software Engineer in AI team at Xavor Corporation. He has vast experience in building AI systems for Surveillance, Cybersecurity, Healthsector and Remote Sensing. Currently he is actively involved in Smart Products development, a Digital Transformation initiative of Xavor Corporation.