
Hierarchical Clustering¶
Library 임포트¶
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [ ]:
In [ ]:
dataset 읽어오기¶
In [3]:
df = pd.read_csv("../data/Mall_Customers.csv")
In [4]:
df
Out[4]:
CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|---|
0 | 1 | Male | 19 | 15 | 39 |
1 | 2 | Male | 21 | 15 | 81 |
2 | 3 | Female | 20 | 16 | 6 |
3 | 4 | Female | 23 | 16 | 77 |
4 | 5 | Female | 31 | 17 | 40 |
... | ... | ... | ... | ... | ... |
195 | 196 | Female | 35 | 120 | 79 |
196 | 197 | Female | 45 | 126 | 28 |
197 | 198 | Male | 32 | 126 | 74 |
198 | 199 | Male | 32 | 137 | 18 |
199 | 200 | Male | 30 | 137 | 83 |
200 rows × 5 columns
In [ ]:
In [ ]:
In [ ]:
Training the Hierarchical Clustering model¶
In [5]:
X = df.loc[:, 'Annual Income (k$)':]
In [6]:
X
Out[6]:
Annual Income (k$) | Spending Score (1-100) | |
---|---|---|
0 | 15 | 39 |
1 | 15 | 81 |
2 | 16 | 6 |
3 | 16 | 77 |
4 | 17 | 40 |
... | ... | ... |
195 | 120 | 79 |
196 | 126 | 28 |
197 | 126 | 74 |
198 | 137 | 18 |
199 | 137 | 83 |
200 rows × 2 columns
In [ ]:
Dendrogram 을 그리고, 최적의 클러스터 갯수를 찾아보자.¶
In [7]:
import scipy.cluster.hierarchy as sch
In [12]:
sch.dendrogram( sch.linkage(X, method = 'ward') ) # ward 방식으로 묶는다.
plt.title("Dendrogram")
plt.xlabel("'Customers'")
plt.ylabel("Euclidean Distances")
plt.show
Out[12]:
<function matplotlib.pyplot.show(close=None, block=None)>
In [ ]:
Training the Hierarchical Clustering model¶
In [13]:
from sklearn.cluster import AgglomerativeClustering
In [14]:
hc = AgglomerativeClustering(n_clusters= 5)
In [16]:
y_pred = hc.fit_predict(X)
In [17]:
df["Group"] = y_pred
In [18]:
df
Out[18]:
CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | Group | |
---|---|---|---|---|---|---|
0 | 1 | Male | 19 | 15 | 39 | 4 |
1 | 2 | Male | 21 | 15 | 81 | 3 |
2 | 3 | Female | 20 | 16 | 6 | 4 |
3 | 4 | Female | 23 | 16 | 77 | 3 |
4 | 5 | Female | 31 | 17 | 40 | 4 |
... | ... | ... | ... | ... | ... | ... |
195 | 196 | Female | 35 | 120 | 79 | 2 |
196 | 197 | Female | 45 | 126 | 28 | 0 |
197 | 198 | Male | 32 | 126 | 74 | 2 |
198 | 199 | Male | 32 | 137 | 18 | 0 |
199 | 200 | Male | 30 | 137 | 83 | 2 |
200 rows × 6 columns
In [ ]:
In [ ]:
그루핑 정보를 확인¶
In [22]:
plt.figure(figsize=[12,8])
plt.scatter(X.values[y_pred == 0, 0], X.values[y_pred == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X.values[y_pred == 1, 0], X.values[y_pred == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X.values[y_pred == 2, 0], X.values[y_pred == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X.values[y_pred == 3, 0], X.values[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X.values[y_pred == 4, 0], X.values[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
In [ ]:
In [ ]:
그룹 정보를 데이터셋에 추가¶
In [ ]:
In [ ]:
In [ ]:
마케팅 이메일을 보내기 위해, 3번 그룹의 고객들만 가져와보자.¶
In [23]:
df.loc[df["Group"]==3, ]
Out[23]:
CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | Group | |
---|---|---|---|---|---|---|
1 | 2 | Male | 21 | 15 | 81 | 3 |
3 | 4 | Female | 23 | 16 | 77 | 3 |
5 | 6 | Female | 22 | 17 | 76 | 3 |
7 | 8 | Female | 23 | 18 | 94 | 3 |
9 | 10 | Female | 30 | 19 | 72 | 3 |
11 | 12 | Female | 35 | 19 | 99 | 3 |
13 | 14 | Female | 24 | 20 | 77 | 3 |
15 | 16 | Male | 22 | 20 | 79 | 3 |
17 | 18 | Male | 20 | 21 | 66 | 3 |
19 | 20 | Female | 35 | 23 | 98 | 3 |
21 | 22 | Male | 25 | 24 | 73 | 3 |
23 | 24 | Male | 31 | 25 | 73 | 3 |
25 | 26 | Male | 29 | 28 | 82 | 3 |
27 | 28 | Male | 35 | 28 | 61 | 3 |
29 | 30 | Female | 23 | 29 | 87 | 3 |
31 | 32 | Female | 21 | 30 | 73 | 3 |
33 | 34 | Male | 18 | 33 | 92 | 3 |
35 | 36 | Female | 21 | 33 | 81 | 3 |
37 | 38 | Female | 30 | 34 | 73 | 3 |
39 | 40 | Female | 20 | 37 | 75 | 3 |
41 | 42 | Male | 24 | 38 | 92 | 3 |
In [ ]:
In [ ]:
# 각 그룹별 수입과 소비지표의 평균을 구하세요.
In [27]:
pd.pivot_table(df, index="Group", aggfunc = np.mean)
Out[27]:
Age | Annual Income (k$) | CustomerID | Spending Score (1-100) | |
---|---|---|---|---|
Group | ||||
0 | 41.000000 | 89.406250 | 166.250000 | 15.593750 |
1 | 42.482353 | 55.811765 | 87.894118 | 49.129412 |
2 | 32.692308 | 86.538462 | 162.000000 | 82.128205 |
3 | 25.333333 | 25.095238 | 22.000000 | 80.047619 |
4 | 45.217391 | 26.304348 | 23.000000 | 20.913043 |
In [26]:
df.groupby("Group")[['Annual Income (k$)','Spending Score (1-100)']].describe()
Out[26]:
Annual Income (k$) | Spending Score (1-100) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
Group | ||||||||||||||||
0 | 32.0 | 89.406250 | 16.612975 | 71.0 | 78.0 | 86.5 | 98.25 | 137.0 | 32.0 | 15.593750 | 8.936548 | 1.0 | 9.75 | 15.0 | 20.5 | 39.0 |
1 | 85.0 | 55.811765 | 9.731508 | 39.0 | 48.0 | 57.0 | 63.00 | 79.0 | 85.0 | 49.129412 | 7.281399 | 29.0 | 43.00 | 49.0 | 55.0 | 65.0 |
2 | 39.0 | 86.538462 | 16.312485 | 69.0 | 75.5 | 79.0 | 95.00 | 137.0 | 39.0 | 82.128205 | 9.364489 | 63.0 | 74.50 | 83.0 | 90.0 | 97.0 |
3 | 21.0 | 25.095238 | 7.133756 | 15.0 | 19.0 | 24.0 | 30.00 | 38.0 | 21.0 | 80.047619 | 10.249274 | 61.0 | 73.00 | 77.0 | 87.0 | 99.0 |
4 | 23.0 | 26.304348 | 7.893811 | 15.0 | 19.5 | 25.0 | 33.00 | 39.0 | 23.0 | 20.913043 | 13.017167 | 3.0 | 9.50 | 17.0 | 33.5 | 40.0 |