Financial Statement Fraud Detection Machine Learning Model Design
We propose designing a financial statement fraud detection machine learning model in this study. The model is a classification algorithm chosen among several tested in this study. The decision tree has given better precision compared to others. The main features were based on the conjunction of two proven mathematic empirical models, Beneish MScore and Altman ZScore. Both models were combined to give one new model. It is proven that each model separately can detect fraud with at least 90% accuracy.
Research question: How to design a machine learning classification using Beinish MScore and Altman Zscore?
Introduction
Financial reporting is a mandatory document for any company; different parties use it for diverse goals. Investors can evaluate the financial health of a firm before investing. The government uses it for tax purposes. Stakeholders can use it for financial analysis. False reporting misleads all analysis and interpretation of financial reporting with severe consequences. Some wellknown scandals, such as the Waste Management Scandal in 1998, the Eron Corp scandal in 2001, and the Worldcom scandal, resulted in billions of dollars in losses and bankruptcy. Many models to detect financial distress have been developed. Among them, this study focuses on two models: Beinish MScore [1] and the Altman ZScore model [2]. These models are mathematically probabilistic. (Altman et al., 2010) [2]. It Showed that the model was accurate, with a correct prediction percentage of about 95%, and received many positive reactions.
Beinish (1999) [1] correctly predicted 76% of financial distress while incorrectly indicating 17.5% of nonfinancial distress.
Both Beinish MScore and Altman ZScore models are based on empirical mathematical formulas, and the limit zone of decision is constant, which is good when evaluating only one company once. Generally, a single mathematical formula is applied to one element only and not to a dataset, so if the variable environment changes, the formula will not dynamically catch the change. In our study of financial reporting fraud behavior, we deal with a very dynamic economic environment, so that some macroeconomic change can affect the trade between companies, or a financial crisis can affect the industry with a relative impact per sector. This situation can systematically falsify the results when using these static empirical models. Thus, using industry performance as a benchmark and adapting the zone limit decision, which was constant for a single company, can now be variable when evaluating one company concerning another company's performance. This approach is more realistic and allows us to make better decisions.
Muntari (2015) [3], during his study of the Eron Corp scandal, used both Altman ZScore and Beinish MScore to evaluate the Eron Corp financial manipulation, got a contradiction in the year 1999, when the result showed for Alman ZScore a value of 3.040 and Beinish MScore a value of 1.323. This situation makes Eron Corp good considering the Altman ZScore decision and bad for Beinish MScore. This contradiction cannot be solved by lecturing both models separately using their constant limit zone of decision. In this study, we propose to solve this situation with a datadriven model that uses industry performance to make decisions.
Combining both model types and building one that detects the different zones based on the industry’s behavior. If the binary classification is studied, then the distress zone is not roughly determined by the mean of the empirical mathematical formula, which is more of a determinist approach, but rather by learning from other companies' data and then classifying the studied company. A correlation study can determine a combined model based on the two models.
This study is structured with the following points:
 Data analysis to help build study assumptions
 Engineering datadriven models with various assumptions and features
 Machine learning binary classification compares the outcomes of different classification algorithms.
Empirical model MScore and ZScore Overview
ALTMAN ZScore Context
Altman‘s ZScore is a model used to predict whether a company is in financial distress.
The Altman Zscore is based on five financial ratios calculated from a financial statement report. It uses profitability, leverage, liquidity, solvency, and activity to predict whether a company has a high probability of being insolvent.
Muntari Mahama [3] shows that ZScore can predict financial distress with 95% accuracy. He also demonstrated that the model could detect financial distress in the early stages of the Eron fraud scandal in 1997, before the bankruptcy in 2001. The ZScore model confirmed the correlation between corporate governance and corporate failure.
ALTMAN ZScore Mathematical Model
ZScore [2] is done by the following formula:
The interpretation of the Zscore is given below:
 Z ˃ 2.67 : interval free of distress
 1.81 ˂ Z ˂ 2.67 interval warning of distress
 Z ˂ 1.81 interval of distress
Beinish MScore Context
According to Muntari [3] and Ganga et al. [4], MScore is a mathematical model that employs some financial metrics to determine the extent of a company's earnings. The MScore is like the ZScore, except the MScore concentrates on estimating the extent of earnings manipulation instead of determining when a company becomes bankrupt. The MScore is composed of eight ratios that capture either financial statement distortions that can result from earnings manipulation or indicate a predisposition to engage in earnings manipulation, showing that MScore can detect financial manipulation accurately. He also demonstrated that the model could detect financial manipulation in the Eron fraud scandal in the early stages of 1997.
Beinish MScore Mathematical Model
The following mathematical expression gives the Beneish model [1]:
DSRI: Days’ sale in receivables index.
The index measures the ratio of days that sales are in accounts receivable in a year compared to that of the prior year. A value greater than 1.0 indicates possible revenue inflation.
Where AR is Account Receivable
GMI: Gross Margin Index.
The gross margin index measures the prior year ‘s gross margin ratio with the one under review. A value greater than 1.0 indicates that the gross margin has worsened in the period under review. It indicates the possibility of revenue manipulation.
Where COGS is Cost of Goods Sold
AQI: The index indicates the change in asset realization risk by comparing current assets and property, plant, and equipment with total assets.
When AQI value is greater than 1.0, the company has potentially increased its cost deferral or increased its intangible assets and created earnings manipulation.
Where CA is the Current Assets, PPE is the Plant Property, and equipment TA is the Total Assets.
DSRI: This index measures the growth in revenue in one year over the prior year’s revenue.
If the value exceeds 1.0, it indicates positive growth, while less than 1.0 indicates negative growth in the year under review. Though other factors may be responsible, growth in sales may be interpreted to mean earnings manipulation.
Where S is Sales
DEPI: This index measures the ratio of depreciation expense and gross value of PPE in one year over a prior year.
An index value greater than 1.0 could reflect an upward adjustment in the useful life of PPE. This can potentially to manipulate a company's earnings in the current fiscal year.
Where D is Depreciation expense and PPE is the Property Plant Equipment
SGAI: Sales, General, and Administrative Expenses Index.
This index measures the ratio of sales, general and administrative expenses to sales in one year over a prior year. A disproportionate increase in sales, as compared to SGAI, would serve as a negative indication concerning a company‘s future prospects.
Where: SGAE: Sales, General and Administrative Expenses, S is Sales
TATA: Total Accruals to Total Assets.
The index assesses the extent to which management employs discretionary accounting policies that result in earnings.
Where CNC Change in Working Capital, CC Change In Cash, CITP Change In Income Tax Payable, D&A Depreciation & Amortization, TA Total Assets
LEVI: The leverage index measures the ratio of total debt to total assets.
An index greater than 1.0 can be interpreted as an increase in the company's gearing and, for that matter, exposed to manipulation.
Where: CL Current Liabilities, TA: Total Assets, LTD Long Term Debt
Input Data Sample
Table 1: Sample of calculated score for 4 years

Figure 1: Beinish and Altman Score of all years in data set
The raw plot of all year shows similar behavior each year. Therefore only one year is sufficient for our study since we do not need trend analysis. In the remaining of the study, we are going to use only the year 2015.
Figure 2: Beinish and Altman Score of year 2015
There is no visual correlation between Beinish MScore and Altman ZScore.
Machine Learning Model Designing
We analyze basis statistics to make assumptions for a model and visualize some results. Only one year of data is necessary for our analysis. Other years can be used for the validation of the model. The Alman’s ZScore outcome is a value from an empiric mathematical formula and the Beneish MScore, so the range values of fraud evaluation are given below.
Table 2: ZScore and MScore Range












Basic Statistics
Table 3: Basic statistics for year 2015



























The general overview of this statistic seems normal, with more than 50% of each score being in the green zone. This means we assume that, in reality, most of companies are not engaging in financial statement manipulation. For this sample, Altman ZScore’s mean is greater than 2.99, Beinish MScore’s mean is lower than 2.22, and all 50% are in the green zone. These observations suggest that we assume that the mean can be used as a benchmark to evaluate a company's financial distress compared to others in its industry.
Feature engineering
Features are selected to build the model that the classification algorithm will learn to achieve the expected results. The proper selection will lead to the final labels, which the classifier will use to train and test the model.
The first features calculated are the Altman ZScore and Bernish MScore, calculated from the formulas above. Let's assign a variable name to each feature:
Altman calculated Z. ZScore and M: derived from the Beinish MScore, and D is the decision variable, and the feature selection aims to give D one of two values: true (1) or false (0), based on the model assuming that we will develop in the subsequent development. The issue is one of binary classification. The company is classified as financial fraudulent when D = 0 and not financial fraudulent when D = 1.
Let's create small z and m, two variables value of Z and M, define the first decision path and handle the problematic case using datadriven assumptions and thus make a decision value. All features are dependent variables of Z and M, and all created features are independent of each other for a single company.
Table 3: First step of features selection








































F4 to F7 are made to address the contradiction between M and Z. Both Z and M have a percentage of incorrect predictions. So, in this combined module, the idea is that when one original model M or Z, shows a red or yellow zone, we look at the other variable zone as a comparison. Still, in this case, we add a third factor benchmark on the variable in the green zone. The idea is strong enough compared to the industry, leading to a calculated weight factor of wz and wm. Both weights are calculated as follows:
Where µz is the mean of the ZScore of the dataset. Similarly, we have:
Where µm: the mean of the MScore of the dataset.
 Assumption on the mean:
 In reality, we believe that the most of companies do not manipulate financial reports.
 The chosen dataset has to be well chosen with enough companies. It is preferable to get those companies from the same sector.
 Each mean is strictly in the green zone, so µz> 2.99 and µm < 2.22
 Features Decision Value for problematic cases
 In the case of F4 and F5, if wz > 1, F4 = 1 and F5 = 1, otherwise F4 = 0 and F5 = 0.
 In the case of F6 and F7, if wm is 1, then F6 = 1 and F7 = 1, otherwise F6 = 0 and F7 = 0.
These simple assumptions solve the problem of contradiction and use the mean metric of the whole dataset for comparison, which is more realistic for benchmarking.
 Decision variable and label for the model. All added features are Boolean and mutually exclusive. Let's model the features as a matrix of features in 2 dimensions. M, Mij is a value where i represents the lines and j represents the columns. M can be divided into two matrices, S and F. Where S is the Score Matrix and F the Dummies' feature matrix
S=ZM Z and M are one column matrix with the respective values of ZScore and MScore.
. F=F1F2F3F4F5F6F7 Fi with i∈1..7 is a one column matrix with either 0 or 1 value
Let’s call D the one column decision matrix D is done by the logical OR operation on all Fi.
D=F1 OR F2 OR F3 OR F4 OR F5 OR F6 OR F7
Computation sample result of features
Table 4: Features




































































































































Feature Scaling and Dimension Reduction
Feature Scaling
It is recommended that they be rewritten at the same scale using standardization for improved model performance. We do not need to standardize Boolean features as F.
The result of standardization also called z scores, is that the features will be rescaled so that they'll have the properties of a standard normal distribution with the mean = 0 and the standard deviation = 1. The standard score of a sample is calculated as follows:
Standardizing the features to make them centered around 0 with a standard deviation of 1 is essential when comparing measurements with different units; it is also a general requirement for the optimal performance of many machine learning algorithms.
Dimensionality Reduction
The primary purpose is to avoid overfitting and underfitting the model in the test data.
 Overfitting occurs when the model captures the patterns in the training data well but fails to generalize well to test data. Overfitting can happen when we have too many correlated features. In this case, the model has low bias and high variance.
 Underfitting happens when the model is not complex enough to capture the patterns in the training data well and, therefore, suffers from low performance on unseen data. In this case, the model has high bias and low variance.
Because all Fi in our study are linearly independent, and D is the result of the OR operation of all Fi, we can only keep Z and M as features, and D is the label column.
Table 5: Feature with scaling and dimensionality reduced
















































Processing model
The described model is processed with four different classifier algorithms to evaluate the best performing algorithm for our problem and select it for usage.
Classification result of the model
Figure 3: training dataset classification with four different algorithm
Figure 4: test dataset classification with four different algorithm
In these pictures, the blue zone (0) and the small blue square represent the companies that the model detected are engaging in fraud manipulation. In contrast the orange (1) and small triangles represent companies not found to be engaging in fraud manipulation.
Performance Analysis
Confusion Matrix and Score Overview
The confusion matrix (or error matrix) is one way to summarise the performance of a classifier for binary classification tasks. This square matrix consists of columns and rows that list the number of instances as absolute or relative "actual class" vs. "predicted class" ratios.
Let P be the label of class 1 and N be the label of a second class or the label of all classes not in class 1 in a multiclass setting.
Figure 5: Confusion Matrix definition
The score uses the confusion matrix result to return the ratio of accurate predictions. The formula does this ratio in percentage.
So, the higher the score, the better the model's performance. The final appreciation is done in the test set to validate if the result is not misled by any problem such as overfitting or underfitting.
Table 6: Performance of the four classification algorithm on the model















The general overview of the result shows that the feature selections were good, and there is no warning of overfitting or underfitting. The best performance is done by the decision tree classification algorithm, with 100% accuracy. So, at the end of this study, we choose the bestperforming algorithm to build our model. In this case, it is the decision tree.
After building a model, the next step is to deploy it in production, but this phase is not part of this study. So, in production, if we want to classify one company, all we have to do is put the exact features of that company as input and then run the model to get the binary classification as an output.
Data Driven Research Theory
Mathematical models like Beinish’s MScore and Altman’s ZScore were initially made for one input. So those models are static; they don’t dynamically adapt their boundary decisions to the environment. Furthermore, financial statement behaviors can be different by industry sector, so it is essential to consider the dynamics of the industry to make a decision dynamically. One possible general approach for future research is redefining these models as follows.
Altman’s ZScore and Beinish’s MScore Probabilistic modeling
This model is also a probabilistic model, so we can assume that the model is following a normal distribution. So both models can be expressed as follows:
Standard Normal Distribution
The following formula calculates the Z score.
Where Xk is the standardized value, in this case, it is either the value of Altman’s ZScore or Beinish Mscore. With k=z for Altman’s Zscore and k= m for Beinish Mscore.
µ is the mean of the distribution, as the standard deviation of the distribution
With this assumption, we see that dataset's Altman ZScore and Beneish MScore can be expressed as a function of the mean and the standard deviation. In this approach, the mean and standard deviation are considered constant for the dataset, but they change for each dataset. Therefore, we hypothesize that the decision boundary in a datadriven can also be expressed in terms of the mean and the standard deviation. So the boundary limit should look like this:
Table 7: Data driven decision boundaries












This approach can be generalized to all kinds of models that are applied on entry and have a boundary decision constant. This opens a research area based on the experience of many datasets. The probabilistic model with the assumption of the normal distribution can be used as one research trail.
Conclusion
Altman’s ZScore and Beinish’s MScore are two proven financial reporting fraud, detection models. These models are based on mathematical formulas suitable for one company at a time. This study aimed to analyze one datadriven model using these two mathematical formulas as features and then build a new model of classification. We defined a new model that handles the problem of contradiction resulting from the twobased formulas. The problem we solved was a binary classification problem with a twoway decision on whether a company in the dataset is subject to financial manipulation or not. Ultimately, we trained and tested the model with four different algorithms, and the decision tree showed us better precision. In this case, the analysis of this datadriven model using a decision tree is our choice.
Author: William Tchoudi
Bibliography
[1] Beneish, Messod D. The Detection of Earnings Manipulation. Financial Analysts Journal, no. 5, Informa UK Limited, Sept. 1999, pp. 24–36. Crossref, doi:10.2469/faj.v55.n5.2296.
[2] Altman, E. (1968). Financial ratios. Discriminant analysis and the prediction of
corporate Bankruptcy. The Journal of Finance, Vol. 23(4), pp 589609
[3] Muntari Mahama, Detecting Corporate Fraud And Financial Distress Using The Altman And Beneish Models, International Journal of Economics, Commerce and Management, UK, 2015
[4] Ganga Bhavani, Christian Tabi Amponsah, MScore And ZScore For Detection Of Accounting Fraud, Accountancy Business and the Public Interest 2017