In [1]:
import pandas as pd
from sklearn.model_selection import cross_val_score
Importing the dataset, locally.
In [2]:
df = pd.read_csv('c:\\Python\\PIMAINDIAN.csv',sep=',',names=['Pregnant','Plasma','bp','tricep','insulin','bmi','dpf','age','class'])
In [3]:
df.head()
Out[3]:
KMeans¶
Now we'll apply the KMeans clustering algorithm to the dataset.
In [4]:
from sklearn.cluster import KMeans
In [5]:
kmeans = KMeans(n_clusters = 2)
In [6]:
kmeans.fit(df)
Out[6]:
Now, we have made two clusters, by using the KMeans algorithm. Our point now, is to find out the distribution of diabetic people in both the clusters, and deciding whether it makes any sense.
In [7]:
labels = kmeans.labels_ #this gives us the cluster number of the entries
In [8]:
count1 = 0
count2 = 0
for i in range(len(labels)):
if(labels[i] == 0):
if(df.iloc[i]['class']==1):
count1+=1
else:
if(df.iloc[i]['class']==1):
count2+=1
In [9]:
accuracy1 = 1.0*count1/(len(df)-sum(labels))
In [10]:
accuracy2 = 1.0*count2/sum(labels)
In [11]:
print ('KMeans :Cluster 1 with ',accuracy1*100,'% diabetic people and cluster 2 with',accuracy2*100,'%')
Result
Here, there are two clusters, one of which has 30.18 % diabetic people and the other one has 52.12% diabetic people. This shows us that if a person is in the first cluster then he is more likely to be non-diabetic. We can't say much about the second cluster.
Logistic Regression¶
We'll try the logistic regression over the dataset now.
In [12]:
from sklearn.linear_model import LogisticRegression
In [14]:
y = df['class']
df.drop(['class'],1,inplace=True)
In [16]:
lr = LogisticRegression()
An Logistic Regression object has been made, which we'll use for the cross_val_score() function, which performs cross validation over the dataset, using the predictors and the outputs.
In [17]:
score = cross_val_score(lr,df,y,cv=10)
In [18]:
print('LogisticRegression :',(score.mean()*100),'%')
Result
So, as we can see, logistic regression gave 76.566% accuracy over the dataset.
PCA + Logistic Regression¶
Here, we'll use PCA(principle components analysis) to reduce the dimensions of the problem to the bare important ones. (Quite a brief explanation). Then we'll use Logistic Regression over this dimensionally varied dataset.This can be done via pipelining. The point being that, just like a pipeline (for what?) we put in our resulting PCA model, and an object of the logistic regression, into the pipeline. After which we'll again use the cross_val_score() to find the cross-validated accuracy of the model.
In [19]:
from sklearn.decomposition import PCA
In [21]:
from sklearn.pipeline import Pipeline
In [22]:
pca = PCA()
In [24]:
pipeline = Pipeline(steps=[('pca',pca),('logistic',lr)])
score = cross_val_score(pipeline,df,y)
print('PCA+logisticRegression :',(score.mean()*100),'%')
Result
Well, as we can see, this procedure results in an accuracy of 76.69%, which will make us say
Cooooooool, but not cool enough.as the accuracy didn't rise a lot. Shit happens.
Random Forests:¶
Where are my random forests?
Ohh. Not again! You're growing over the fan now? Why are you so random?
Okay they're back.
So random forests are a very cool method for supervised learning, which use bagging (bootstrap aggregation). A good source to learn about them would be this video by Trevor Hastie and Robert Tibshirani. (I would strongly recommend their book
Ohh. Not again! You're growing over the fan now? Why are you so random?
Okay they're back.
So random forests are a very cool method for supervised learning, which use bagging (bootstrap aggregation). A good source to learn about them would be this video by Trevor Hastie and Robert Tibshirani. (I would strongly recommend their book
Introduction to statistical learning, which is free, and has a parallel Stanford Course too.
In [25]:
from sklearn.ensemble import RandomForestClassifier
In [26]:
rf = RandomForestClassifier()
score = cross_val_score(rf,df,y,cv=10)
In [27]:
print('RandomForests :',(score.mean()*100),'%')
Result:
Random forests give an accuracy of 73.96%. Not quite good. No.
K Nearest Neighbors¶
Well back to basics. If you're new, see what the majority of nearby poeple are doing, and then imitate.But how many of those nearby people should you see?For this case, we'll do KNN for $k=\{5,10,15,20\}$
In [28]:
from sklearn.neighbors import KNeighborsClassifier
In [29]:
for i in {5,10,15,20}:
neigh = KNeighborsClassifier(n_neighbors=i)
score = cross_val_score(neigh,df,y,cv=10)
print('RandomForests :',(score.mean()*100),'%')
Can we get an $k$ versus accuracy graph please?
In [30]:
import matplotlib.pyplot as plt
accuracies = []
for i in range(20):
neigh = KNeighborsClassifier(n_neighbors=i+1)
score = cross_val_score(neigh,df,y,cv=10)
accuracies.append(score.mean())
k = [i+1 for i in range(20)]
In [31]:
plt.plot(k,accuracies)
Out[31]: