Applying Machine Learning in Internal Audit with Sparsely Labeled Data

Jesse He

Uber

•

Jesse He

•11 min read•advanced•

--

•View Original

Machine LearningSQL

Overview

The article discusses the application of machine learning in internal auditing, specifically focusing on the challenges and methodologies used at Uber to analyze sparsely labeled data. It highlights the transition from traditional auditing methods to machine learning approaches to enhance the detection of potential fraud involving cash intermediaries.

What You'll Learn

1

How to apply machine learning techniques to identify potential fraud in internal audits

2

Why labeled data is crucial for training machine learning models in auditing

3

When to use dual-model architectures for complex data relationships

Prerequisites & Requirements

Understanding of machine learning concepts and auditing processes
Familiarity with SQL and data analysis tools(optional)

Key Questions Answered

How did Uber leverage machine learning to enhance internal audits?

Uber utilized machine learning to analyze sparsely labeled data, particularly focusing on identifying cash intermediaries that could pose fraud risks. By employing a dual-model architecture, they improved their ability to predict and analyze vendor transactions, leading to better insights into potential fraudulent activities.

What challenges did Uber face with labeled data in their auditing process?

Uber faced significant challenges due to the limited amount of labeled data, with only 47 out of 477 vendors confirmed as Agents. This scarcity hindered the training of effective machine learning models, prompting the need to expand their dataset to include purchase orders.

What machine learning models were used in Uber's auditing project?

Uber initially used K-nearest neighbors (KNN) and later transitioned to a Random Forest Classifier for their auditing project. These models aimed to predict vendor behavior based on transaction data, with the Random Forest model achieving an average accuracy of 95.9% during validation.

What was the final architecture design for Uber's machine learning models?

The final architecture involved a dual-model setup where the first model predicted suspicious transactions based on transaction-level data, while the second model aggregated results to predict vendor-level behavior. This approach allowed for more comprehensive fraud detection.

Key Statistics & Figures

Number of vendors labeled as Agents

47 out of 477

This limited number of labeled vendors highlighted the challenges in training effective machine learning models.

Average accuracy of Random Forest Classifier

95.9%

Achieved during a 4-fold cross-validation on PO-level data.

Precision of Random Forest Classifier

95.8%

Indicates the model's effectiveness in predicting true positive cases.

Recall of Random Forest Classifier

97.5%

Demonstrates the model's ability to identify actual positive cases among all positive instances.

Technologies & Tools

Database

SQL

Used to analyze and query transaction data in the auditing process.

Machine Learning

Random Forest

Employed as a classification model to predict vendor behavior based on transaction data.

Machine Learning

K-nearest Neighbors

Initially used for vendor-level predictions in the auditing process.

Key Actionable Insights

1
Implementing machine learning in internal audits can significantly enhance fraud detection capabilities.
By utilizing machine learning models, organizations can analyze complex data relationships that traditional methods may overlook, leading to better identification of potential fraud.

2
Expanding datasets beyond labeled data can improve model training outcomes.
In Uber's case, incorporating purchase orders into the dataset allowed for a more robust analysis, addressing the challenges posed by limited labeled data.

3
Utilizing dual-model architectures can effectively handle complex relationships in data.
This approach allows for a more nuanced understanding of vendor behavior and transaction patterns, which is crucial in auditing scenarios.

Common Pitfalls

1

Relying solely on a single model for predictions can lead to inaccurate results.

Single models may not capture the complexity of relationships in data, which is why a dual-model architecture was adopted in this case.

2

Ignoring the importance of labeled data can hinder model performance.

Without sufficient labeled data, machine learning models struggle to learn effectively, leading to poor predictions.

Related Concepts

Machine Learning In Auditing

Fraud Detection Methodologies

Data Labeling Techniques

Dual-model Architectures In Machine Learning