lobidesign.blogg.se - Credit card validator python project

Here we have maximum accuracy, minimum accuracy and average accuracy across 10 fold validation set. Print('Overall Accuracy:',mean(lst_accu_stratified))

Print('Minimum Accuracy:',min(lst_accu_stratified)) Print('Maximum Accuracy',max(lst_accu_stratified)) We have then fit our model at each set and thereby calculated accuracy score. Skf.split has divided our model into 10 random index set. Lst_accu_stratified.append(model.score(X_test_fold, y_test_fold)) Step 4 - Building Stratified K fold cross validationįor train_index, test_index in skf.split(X, y): Now for StratifiedKFold, we have kept n_splits to be 10, dividing our dataset for 10 times. We have simply built a regressor model with LogisticRegression with default values. Skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1) Step 3 - Building the model and Cross Validation model Here, we have used load_breast_cancer function to import our dataset in two list form (X and y) and therefore kept return_X_y to be True. Now StratifiedKFold will help us in performing Stratified K fold cross-validation. Also, we have exported LogisticRegression to build the model. Here sklearn.dataset is used to import one classification based model dataset.

Step 4 - Building Stratified K fold cross validationįrom sklearn.datasets import load_breast_cancerįrom sklearn.linear_model import LogisticRegressionįrom sklearn.model_selection import StratifiedKFold.

Step 3 - Building the model and Cross Validation model.

errors1 = (y_prediction1 != y).sum() # Total number of errors is calculated. y_prediction1 = 1 # Fraudulent transactions are labelled as 1. y_prediction1 = 0 # Valid transactions are labelled as 0. Performance of Local Outlier Factor model a = LocalOutlierFactor(n_neighbors = 20,contamination = outlier_fraction) y_prediction1 = a.fit_predict(X) # Fitting the model. Input and Output X = df.drop(‘Class’,axis = 1) # X is input y = df # y is output fraud = df = 1] # Number of fraudulent transactions valid = df = 0] # Number of valid transactions outlier_fraction = len(fraud)/float(len(valid))

The outlier fraction is to be calculated. It is a very fast algorithm with a low memory demand. Instead of trying to build a model of normal instances, it explicitly isolates anomalous points in the dataset. The Isolation forest is an unsupervised algorithm for anomaly detection that works on principle of isolating anomalies. The anomaly score depends on how isolated the sample is with respect to surrounding neighborhood. It measures the local deviation of density of a given sample with respect to its neighbors. It calculates the anomaly score of each sample. The local outlier factor is an unsupervised outlier detection method. A comparison is made between 2 models :- Linear Outlier Factor & Isolation Forest. The data set has been preprocessed and is ready to be trained. Summary :- Since there are no missing values, no columns to be dropped and no incorrect data, there are no preprocessing steps required and we proceed to training the model. Therefore we don’t drop any of the columns as they are fairly unrelated to each other. There are no significant correlations between the reduced features (‘V1’ to V’28'). The other correlations are relatively small. ‘Class’ is less correlated with ‘Amount’ and ‘Time’ which suggests it is hard to predict whether transaction is fraudulent or not from ‘Amount’ and ‘Time’ details of transaction. df.nunique() # Prints total number of unique elements in each column ‘V1’ to ‘V28’ are reduced features of transaction details which can’t be disclosed. ‘Class’ denotes whether transaction is fraudulent or not. ‘Time’ and ‘Amount’ denote time and amount of transaction respectively. Now data frame ‘df’ is of shape (24831,31) df.columns # Prints columns of data frame ‘df’ df = df.sample(frac=0.1) # Size of data frame is reduced Though training the model with large data samples gives better results, it is at the cost of computational power and time. Hence the data frame is downsampled to one-tenth of previous size by dropping rows. The ‘df’ data frame is of shape (284807,31) which implies 284807 cases and 31 columns. Understanding the data is important as we get an intuitive feeling for the data which helps to identify the necessary preprocessing steps. df = pd.read_csv(‘creditcard.csv’) VISUALIZING AND UNDERSTANDING THE DATA import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from trics import classification_report, accuracy_score from sklearn.ensemble import IsolationForest from sklearn.neighbors import LocalOutlierFactor READING THE DATA The data set ‘creditcard.csv’ can be downloaded from kaggle credit card fraud detection. This machine learning project is about detecting fraudulent credit card transactions.