Wednesday 7 February 2018

Pima Indians Dataset: Logistic Regression, KNN, PCA, Random Forests

newNotebook
In [1]:
import pandas as pd
from sklearn.model_selection import cross_val_score
Importing the dataset, locally.
In [2]:
df = pd.read_csv('c:\\Python\\PIMAINDIAN.csv',sep=',',names=['Pregnant','Plasma','bp','tricep','insulin','bmi','dpf','age','class'])
In [3]:
df.head()
Out[3]:
Pregnant Plasma bp tricep insulin bmi dpf age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

KMeans

Now we'll apply the KMeans clustering algorithm to the dataset.
In [4]:
from sklearn.cluster import KMeans
In [5]:
kmeans = KMeans(n_clusters = 2)
In [6]:
kmeans.fit(df)
Out[6]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
Now, we have made two clusters, by using the KMeans algorithm. Our point now, is to find out the distribution of diabetic people in both the clusters, and deciding whether it makes any sense.
In [7]:
labels = kmeans.labels_ #this gives us the cluster number of the entries
In [8]:
count1 = 0
count2 = 0
for i in range(len(labels)):
    if(labels[i] == 0):
        if(df.iloc[i]['class']==1):
            count1+=1
    else:
        if(df.iloc[i]['class']==1):
            count2+=1
            
In [9]:
accuracy1 = 1.0*count1/(len(df)-sum(labels))
In [10]:
accuracy2 = 1.0*count2/sum(labels)
In [11]:
print ('KMeans :Cluster 1 with ',accuracy1*100,'% diabetic people and cluster 2 with',accuracy2*100,'%')			
KMeans :Cluster 1 with  30.18242122719735 % diabetic people and cluster 2 with 52.121212121212125 %

Result


Here, there are two clusters, one of which has 30.18 % diabetic people and the other one has 52.12% diabetic people. This shows us that if a person is in the first cluster then he is more likely to be non-diabetic. We can't say much about the second cluster.

Logistic Regression

We'll try the logistic regression over the dataset now.
In [12]:
from sklearn.linear_model import LogisticRegression
In [14]:
y = df['class']
df.drop(['class'],1,inplace=True)
In [16]:
lr = LogisticRegression()
An Logistic Regression object has been made, which we'll use for the cross_val_score() function, which performs cross validation over the dataset, using the predictors and the outputs.
In [17]:
score = cross_val_score(lr,df,y,cv=10)
In [18]:
print('LogisticRegression :',(score.mean()*100),'%')
LogisticRegression : 76.566985645933 %

Result


So, as we can see, logistic regression gave 76.566% accuracy over the dataset.

PCA + Logistic Regression

Here, we'll use PCA(principle components analysis) to reduce the dimensions of the problem to the bare important ones. (Quite a brief explanation). Then we'll use Logistic Regression over this dimensionally varied dataset.

This can be done via pipelining. The point being that, just like a pipeline (for what?) we put in our resulting PCA model, and an object of the logistic regression, into the pipeline. After which we'll again use the cross_val_score() to find the cross-validated accuracy of the model.
In [19]:
from sklearn.decomposition import PCA
In [21]:
from sklearn.pipeline import Pipeline
In [22]:
pca = PCA()
In [24]:
pipeline = Pipeline(steps=[('pca',pca),('logistic',lr)])
score = cross_val_score(pipeline,df,y)
print('PCA+logisticRegression :',(score.mean()*100),'%') 
PCA+logisticRegression : 76.69706152564787 %

Result


Well, as we can see, this procedure results in an accuracy of 76.69%, which will make us say Cooooooool, but not cool enough. as the accuracy didn't rise a lot. Shit happens.

Random Forests:

Where are my random forests?


Ohh. Not again! You're growing over the fan now? Why are you so random?

Okay they're back.

So random forests are a very cool method for supervised learning, which use bagging (bootstrap aggregation). A good source to learn about them would be this video by Trevor Hastie and Robert Tibshirani. (I would strongly recommend their book Introduction to statistical learning, which is free, and has a parallel Stanford Course too.
In [25]:
from sklearn.ensemble import RandomForestClassifier
In [26]:
rf = RandomForestClassifier()
score = cross_val_score(rf,df,y,cv=10)
In [27]:
print('RandomForests :',(score.mean()*100),'%')
RandomForests : 73.96103896103897 %

Result:


Random forests give an accuracy of 73.96%. Not quite good. No.

K Nearest Neighbors

Well back to basics. If you're new, see what the majority of nearby poeple are doing, and then imitate.But how many of those nearby people should you see?

For this case, we'll do KNN for $k=\{5,10,15,20\}$
In [28]:
from sklearn.neighbors import KNeighborsClassifier
In [29]:
for i in {5,10,15,20}:
    neigh = KNeighborsClassifier(n_neighbors=i)
    score = cross_val_score(neigh,df,y,cv=10)
    print('RandomForests :',(score.mean()*100),'%')
RandomForests : 74.34723171565277 %
RandomForests : 74.61893369788108 %
RandomForests : 72.13773069036226 %
RandomForests : 74.48051948051948 %
Can we get an $k$ versus accuracy graph please?
In [30]:
import matplotlib.pyplot as plt
accuracies = []
for i in range(20):
    neigh = KNeighborsClassifier(n_neighbors=i+1)
    score = cross_val_score(neigh,df,y,cv=10)
    accuracies.append(score.mean())
k = [i+1 for i in range(20)]    

    
In [31]:
plt.plot(k,accuracies)
Out[31]:
[<matplotlib.lines.Line2D at 0x446f5c0dd8>]

Result

We get maximum accuracy for $k=18$. Acceptable accuracy can be obtained at $k = 13$ too.

1295D - Same GCDs

 1295D - Same GCDs Link to the problem :  https://codeforces.com/problemset/problem/1295/D I had to see the tutorial to understand this. $$ ...