import pandas as pd
from sklearn.model_selection import cross_val_score

Importing the dataset, locally.

df = pd.read_csv('c:\\Python\\PIMAINDIAN.csv',sep=',',names=['Pregnant','Plasma','bp','tricep','insulin','bmi','dpf','age','class'])

df.head()

KMeans¶

Now we'll apply the KMeans clustering algorithm to the dataset.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 2)

kmeans.fit(df)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Now, we have made two clusters, by using the KMeans algorithm. Our point now, is to find out the distribution of diabetic people in both the clusters, and deciding whether it makes any sense.

labels = kmeans.labels_ #this gives us the cluster number of the entries

count1 = 0
count2 = 0
for i in range(len(labels)):
    if(labels[i] == 0):
        if(df.iloc[i]['class']==1):
            count1+=1
    else:
        if(df.iloc[i]['class']==1):
            count2+=1

accuracy1 = 1.0*count1/(len(df)-sum(labels))

accuracy2 = 1.0*count2/sum(labels)

print ('KMeans :Cluster 1 with ',accuracy1*100,'% diabetic people and cluster 2 with',accuracy2*100,'%')

KMeans :Cluster 1 with  30.18242122719735 % diabetic people and cluster 2 with 52.121212121212125 %

Result

Here, there are two clusters, one of which has 30.18 % diabetic people and the other one has 52.12% diabetic people. This shows us that if a person is in the first cluster then he is more likely to be non-diabetic. We can't say much about the second cluster.

Logistic Regression¶

We'll try the logistic regression over the dataset now.

from sklearn.linear_model import LogisticRegression

y = df['class']
df.drop(['class'],1,inplace=True)

lr = LogisticRegression()

An Logistic Regression object has been made, which we'll use for the cross_val_score() function, which performs cross validation over the dataset, using the predictors and the outputs.

score = cross_val_score(lr,df,y,cv=10)

print('LogisticRegression :',(score.mean()*100),'%')

LogisticRegression : 76.566985645933 %

Result

So, as we can see, logistic regression gave 76.566% accuracy over the dataset.

PCA + Logistic Regression¶

Here, we'll use PCA(principle components analysis) to reduce the dimensions of the problem to the bare important ones. (Quite a brief explanation). Then we'll use Logistic Regression over this dimensionally varied dataset.

This can be done via pipelining. The point being that, just like a pipeline (for what?) we put in our resulting PCA model, and an object of the logistic regression, into the pipeline. After which we'll again use the cross_val_score() to find the cross-validated accuracy of the model.

from sklearn.decomposition import PCA

from sklearn.pipeline import Pipeline

pca = PCA()

pipeline = Pipeline(steps=[('pca',pca),('logistic',lr)])
score = cross_val_score(pipeline,df,y)
print('PCA+logisticRegression :',(score.mean()*100),'%')

PCA+logisticRegression : 76.69706152564787 %

Result

Well, as we can see, this procedure results in an accuracy of 76.69%, which will make us say Cooooooool, but not cool enough. as the accuracy didn't rise a lot. Shit happens.

Random Forests:¶

Where are my random forests?

Ohh. Not again! You're growing over the fan now? Why are you so random?

Okay they're back.

So random forests are a very cool method for supervised learning, which use bagging (bootstrap aggregation). A good source to learn about them would be this video by Trevor Hastie and Robert Tibshirani. (I would strongly recommend their book Introduction to statistical learning, which is free, and has a parallel Stanford Course too.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
score = cross_val_score(rf,df,y,cv=10)

print('RandomForests :',(score.mean()*100),'%')

RandomForests : 73.96103896103897 %

Result:

Random forests give an accuracy of 73.96%. Not quite good. No.

K Nearest Neighbors¶

Well back to basics. If you're new, see what the majority of nearby poeple are doing, and then imitate.But how many of those nearby people should you see?

For this case, we'll do KNN for $k=\{5,10,15,20\}$

from sklearn.neighbors import KNeighborsClassifier

for i in {5,10,15,20}:
    neigh = KNeighborsClassifier(n_neighbors=i)
    score = cross_val_score(neigh,df,y,cv=10)
    print('RandomForests :',(score.mean()*100),'%')

RandomForests : 74.34723171565277 %
RandomForests : 74.61893369788108 %
RandomForests : 72.13773069036226 %
RandomForests : 74.48051948051948 %

Can we get an $k$ versus accuracy graph please?

import matplotlib.pyplot as plt
accuracies = []
for i in range(20):
    neigh = KNeighborsClassifier(n_neighbors=i+1)
    score = cross_val_score(neigh,df,y,cv=10)
    accuracies.append(score.mean())
k = [i+1 for i in range(20)]

plt.plot(k,accuracies)

[<matplotlib.lines.Line2D at 0x446f5c0dd8>]

Result¶

We get maximum accuracy for $k=18$. Acceptable accuracy can be obtained at $k = 13$ too.

Blatherstrike - The strike of the nonsense

Wednesday 7 February 2018

Pima Indians Dataset: Logistic Regression, KNN, PCA, Random Forests

KMeans¶

Result

Logistic Regression¶

Result

PCA + Logistic Regression¶

Result

Random Forests:¶

Result:

K Nearest Neighbors¶

Result¶

Labels

Blog Archive

Report Abuse

1295D - Same GCDs

	Pregnant	Plasma	bp	tricep	insulin	bmi	dpf	age	class
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1