๋™์•„๋ฆฌ,ํ•™ํšŒ/GDGoC

[AI ์Šคํ„ฐ๋””] Section 6 : ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ

egahyun 2024. 12. 26. 05:45

Clustering

: ๋น„์Šทํ•œ object๋“ค ๋ผ๋ฆฌ ๋ชจ์œผ๋Š” ๊ฒƒ

ํŠน์ง•

  1. Label data๊ฐ€ ์—†์Œ (๋ถ„๋ฅ˜์™€์˜ ์ฐจ์ด์ )
  2. Unsupervised ML

์ ์šฉ ์‚ฌ๋ก€

  • ๊ณ ๊ฐ์˜ ๊ตฌ๋งค ํ˜•ํƒœ๋ณ„ ๋ถ„๋ฅ˜
  • ๊ณ ๊ฐ์˜ ์ทจํ–ฅ์— ๋งž๋Š” ์ฑ…, ๋™์˜์ƒ ๋“ฑ์˜ ์ถ”์ฒœ
  • ์‹ ์šฉ์นด๋“œ ์‚ฌ์šฉ์˜ fraud detection
  • ๋‰ด์Šค ์ž๋™ ๋ถ„๋ฅ˜ ๋ฐ ์ถ”์ฒœ
  • ์œ ์ „์ž ๋ถ„์„ ๋“ฑ

์ข…๋ฅ˜

  1. K-Means Clustering
  2. Hierarchical Clustering (dendrogram)
  3. Density-based Clustering (DBSCAN)

K-Means Clustering ์•Œ๊ณ ๋ฆฌ์ฆ˜

[ Distance ๊ณ„์‚ฐ ]

  1. Distance = Euclidean Distance (์œ ํด๋ฆฌ๋“œ ๊ธฐํ•˜ํ•™ ๊ฑฐ๋ฆฌ)
  2. ๊ณต์‹ : $\sqrt{\sum^{n}{i=0} (x{1i} - x_{2i} )^2 }$
  3. ์˜ˆ์‹œ
    ๊ณ ๊ฐ ๋‚˜์ด ์ˆ˜์ž… ๊ต์œก
    1 → x1 54 190 3
    2 → x2 50 200 8
    Distance(x1, x2) = $\sqrt{(54 − 50)^2 + (190 − 200)^ 2 + (3 − 8)^ 2}$
    ⇒ ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ ์‹

[ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ณผ์ • ] 

  1. Random ํ•˜๊ฒŒ k ๊ฐœ์˜ centroid (์ค‘์‹ฌ์ ) ๋ฅผ ์ •ํ•œ๋‹ค.
  2. ๊ฐ centroid ๋กœ ๋ถ€ํ„ฐ ๊ฐ data point ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐ.
  3. ๊ฐ data point ๋ฅผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด centroid๋กœ ํ• ๋‹น
  4. ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ centroid ์˜ ์œ„์น˜๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐ : ํด๋Ÿฌ์Šคํ„ฐ๋ณ„๋กœ, ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ด ๊ทธ ์ ์œผ๋กœ centroid๋ฅผ ์˜ฎ๊น€
  5. ⇒ ํด๋Ÿฌ์Šคํ„ฐ ๋ณ„๋กœ ์†ํ•˜๋Š” ์ ๋“ค์ด ๊ณ„์† ๋ฐ”๋€œ
  6. centroid ๊ฐ€ ๋” ์ด์ƒ ์›€์ง์ด์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ 2-4 ๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณต

[ K ์ •ํ•˜๊ธฐ ]

  1. ๋ฐฉ๋ฒ• : ๊ฒฝํ—˜์ ์œผ๋กœ $k=\sqrt{n} \ \ (n : data \ sample \ ๊ฐœ์ˆ˜)$
    (๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ centroid์—์„œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ๊ฐ€ ์–ด๋А ์ˆœ๊ฐ„์ด ๋˜๋ฉด ๊ธ‰๊ฒฉํžˆ ์ค„๋‹ค๊ฐ€ ์™„๋งŒํžˆ ๊ฐ์†Œํ•˜๋Š” ํ˜•ํƒœ๋กœ ๋ณ€ํ•  ๋•Œ์˜ ์ )
    elbow point์˜ k๊ฐ’์„ ์‚ฌ์šฉ
  2. ํŠน์ง•
    • K ๋ฅผ ์ž˜ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค
    • ํ”ผ์ฒ˜๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก ๊ณ„์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€
    • K๋ฅผ ๋งŽ์ด ์ค„์ˆ˜๋ก, ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ centroid์—์„œ์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ์†Œ

[ ํŠน์ง• ]

  1. ๋‹จ์  1: ์ž„์˜๋กœ cluster ์ง€์ •ํ•˜๋ฏ€๋กœ same cluster ๋‚ด์˜ data point ๋“ค์ด ์‹ค์ œ๋กœ๋Š” ์œ ์‚ฌํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค
  2. ๋‹จ์  2: ์ด์ƒ์น˜ ๊ฐ์ง€ ๋ถˆ๊ฐ€ : ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด centroid๋กœ ์ง‘์–ด ๋„ฃ์œผ๋ฏ€๋กœ, ์ด์ƒ์น˜๋„ ์–ด๋”˜๊ฐ€์˜ ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•˜๊ฒŒ๋œ๋‹ค
  3. Spherical-shape clusters : ๊ตฌํ˜•์œผ๋กœ๋œ shape์˜ ํด๋Ÿฌ์Šคํ„ฐ

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

  1. ํŠน์ง•
    • ๋ฐ€๋„๊ฐ€ ๋†’์€ ์ง€์—ญ๊ณผ ๋‚ฎ์€ ์ง€์—ญ์„ ์„œ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ์‹ค์ œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๋“ค์ด ์œ ์‚ฌ (๋ฐ€๋„ – ํŠน์ • ๋ฐ˜๊ฒฝ๋‚ด์˜ data point ์ˆซ์ž)
    • ์ด์ƒ์น˜ ๊ฐ์ง€ ๊ฐ€๋Šฅ : ๋ฐ€๋„ ๊ธฐ์ค€์œผ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์†ํ•  ๊ณณ์ด ์—†์–ด, ์ด์ƒ์น˜ ๊ฐ์ง€ ๊ฐ€๋Šฅ
      ⇒ a, b๋Š” kmeans์˜ ๋‹จ์  ํ•ด๊ฒฐ
    • Outlier ์˜ ์˜ํ–ฅ์„ ์ ๊ฒŒ ๋ฐ›์Œ
    • cluster ์ˆซ์ž๋ฅผ ๋ฏธ๋ฆฌ ์ •ํ•ด์ฃผ์ง€ ์•Š์•„๋„ ๋จ
    • Arbitrary-shape clusters
  2. ์‹คํ–‰ ๊ณผ์ •
    • Radius(๋ฐ˜๊ฒฝ) , Minimum Neighbor number ์ง€์ •
    • ๊ฐ point ๋ฅผ Core, Border, Outlier ๋กœ ๊ตฌ๋ถ„
      • core : ํ•œ ์ ์„ Radius๋ฅผ ๊ธฐ์ค€์œผ๋กœ Minimum Neighbor number๊ฐœ์ˆ˜ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ์ ์ด ์žˆ์„๊ฒฝ์šฐ ํ•ด๋‹น ์ ์„ core๋ผ๊ณ ํ•จ
      • Border : Minimum Neighbor number ๊ฐœ์ˆ˜ ๋ฏธ๋งŒ์˜ ๋ฐ์ดํ„ฐ ์ ์ด ์žˆ์„ ๊ฒฝ์šฐ
      • Outlier : radius ์•ˆ์˜ ์ด์›ƒํ•˜๋Š” ์ ์ด ์—†๋Š” ๊ฒฝ
    • ๋ชจ๋“  point ์— ๋Œ€ํ•ด ๋™์ผํ•œ ๊ณผ์ • ๋ฐ˜๋ณต

๊ตฐ์ง‘ํ™” ์‹ค์Šต

KMeans

ํŒŒ๋ผ๋ฏธํ„ฐ

  1. init : initialization method -> k-means++ (smart choosing of centroids)
  2. n_clusters : k ๊ฐ’
  3. n_init : ๋ฐ˜๋ณตํšŸ์ˆ˜

ํ† ์ด ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

# seed = 101๋กœ ํ•ด์„œ ์–ด๋–ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋”ฐ๋ผ ๋‚œ์ˆ˜๋ฅผ ์ƒ์„ฑ -> ํ•ญ์ƒ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ๊ฐ€๋Šฅ
np.random.seed(101)
# 3๊ฐœ์˜ centroid ์ ์„ ์คŒ : center ํฌ์ง€์…˜์„ ๋„“๊ฒŒ ๋ฒŒ๋ฆฌ๊ธฐ ์œ„ํ•œ ์ž„์˜์˜ ์ 
centroidLocation = [[3,2], [1,-1],[-1,2]]
# ๋น„์ง€๋„ ํ•™์Šต์„ ์œ„ํ•ด y ์ž๋ฆฌ๋Š” _ ๋ฅผ ์‚ฌ์šฉํ•ด, ์œ„์น˜๋งŒ ์žก์Œ 
X, _ = make_blobs(n_samples=1500, centers=centroidLocation)
# ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์‹œ๊ฐํ™”
plt.scatter(X[:,0], X[:,1], marker='.')
# ๋ฐ์ดํ„ฐ shape
X.shape # (1500, 2)

Kmeans ์‹คํ–‰

from sklearn.cluster import KMeans
# ๋ชจ๋ธ ๊ตฌ์„ฑ
nclusters = 3
k_means = KMeans(n_clusters=nclusters)
# ํ•™์Šต
k_means.fit(X)
# ๋ ˆ์ด๋ธ” : 0,1,2 ์ค‘ ์–ด๋А๊ฑธ๋กœ ์†ํ•˜๋Š”์ง€
k_means.labels_
# 3๊ฐœ์˜ ์ƒ˜ํ”Œ ์œ„์น˜ : [3,2], [1,-1],[-1,2] => [1.04677914, -0.97038147], [3.14135743, 2.01895659], [-0.97958037, 2.04290344]
centers = k_means.cluster_centers_

ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋œ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

from matplotlib.colors import ListedColormap

colors_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
colors_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])

plt.figure(figsize=(8,6))

for i in range(nclusters):
    members = k_means.labels_ == i
    plt.plot(X[members, 0], X[members, 1], '.', 
             color=colors_light(i), markersize=10, label=i)
    plt.plot(centers[i, 0], centers[i, 1], 'o', 
             color=colors_bold(i), markeredgecolor='k', markersize=20)

plt.title("KMeans")
plt.legend()

DBSCAN

ํŒŒ๋ผ๋ฏธํ„ฐ

  1. eps : epsilon (radius)
  2. min_sample : minimum samples within the radius

๋ฐ์ดํ„ฐ ์ƒ์„ฑ

# ์ด์ƒ์น˜๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ์ƒ์„ฑ
X1, _ = make_blobs(n_samples=500, centers=[[-3,-3]])
X2, _ = make_blobs(n_samples=500, centers=[[3,3]])
X3 = np.random.rand(500, 2) * 3 + 4
X4 = np.random.randn(10, 2) * 3  #outlier

X1.shape, X2.shape, X3.shape, X4.shape # ((500, 2), (500, 2), (500, 2), (10, 2))

# ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์‹œ๊ฐํ™”
plt.figure(figsize=(8, 6))
plt.scatter(X1[:, 0], X1[:, 1], marker='.')
plt.scatter(X2[:, 0], X2[:, 1], marker='.')
plt.scatter(X3[:, 0], X3[:, 1], marker='.')
plt.scatter(X4[:, 0], X4[:, 1], marker='.')

# 4๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์นจ
X = np.vstack([X1, X2, X3, X4])
X.shape # (1510, 2)
# ํ•ฉ์นœ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], marker='.')

DBSCAN ๋ชจ๋ธ๋ง

from sklearn.cluster import DBSCAN
# ๋ชจ๋ธ๋ง
epsilon = 0.3 # ๋ฐ˜๊ฒฝ radius
minimumSamples = 7
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(X)

# ๋ ˆ์ด๋ธ” ํ™•์ธ
labels = db.labels_ # [ 0  0 -1 ...  4 -1 -1]
unique_labels = set(labels) # {-1, 0, 1, 2, 3, 4, 5}  -1 : outlier

print(labels.shape) # (1510, )
print(db.core_sample_indices_.shape) # (1234, )
print(db.core_sample_indices_) # [   0    1    3 ... 1499 1505 1507]

DBSCAN ์‹œ๊ฐํ™”

colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

plt.figure(figsize=(8, 6))

for k, col in zip(unique_labels, colors):
    members = (labels == k)
    plt.scatter(X[members, 0], X[members, 1], color=col, 
                marker='o', s=10)

plt.title('DBSCAN')
plt.show()

์ฐจ์› ์ถ•์†Œ ๊ธฐ๋ฒ• : PCA

์ฐจ์›์˜ ์ €์ฃผ (Curse of Dimensionality)

: ์ฐจ์›์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ vector ๊ณต๊ฐ„๋‚ด์˜ space ๋„ ์ฆ๊ฐ€ํ•˜๋Š”๋ฐ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ์ ์œผ๋ฉด ๋นˆ๊ณต๊ฐ„์ด ๋งŽ์ด ๋ฐœ์ƒํ•˜์—ฌ ์˜ˆ์ธก์˜ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค.

  (⇒ ํ”ผ์ฒ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด ์ถ•์ด ๋Š˜์–ด๋‚˜๋ฏ€๋กœ, ์ฐจ์›์ด ์ฆ๊ฐ€ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ํ”ผ์ฒ˜๊ฐ€ ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ๋Š˜์–ด๋‚˜์•ผํ•œ๋‹ค.)

  1. ํ•ด๊ฒฐ๋ฐฉ๋ฒ• : ์œ ์‚ฌํ•œ ์„ฑ๊ฒฉ์˜ feature๋Š” ํ•˜๋‚˜์˜ ์ƒˆ๋กœ์šด feature๋กœ ์„ฑ๋ถ„์„ ํ•ฉ์น  ์ˆ˜ ์žˆ์Œ (์˜ˆ, ํ‚ค, ์‹ ์žฅ, ์•‰์€ํ‚ค ⇒ ํ‚ค)
  2. ์ฐจ์› ์ถ•์†Œ์‹œ, ์ •๋ณด ์†Œ์‹ค ๋ฐœ์ƒ
    ⇒ PC(principal component) : X, Y์ถ•์˜ ์ •๋ณด๋ฅผ ์–ด๋А์ •๋„ ๋ณด์กดํ•˜๋Š” ์ƒˆ๋กœ์šด ์„ 

 

PCA (Principal Component Analysis) : ์ฃผ์„ฑ๋ถ„ ๋ถ„์„

1. ๋ฐฉ๋ฒ• : ์„ ํ˜•๋Œ€์ˆ˜ํ•™์˜ SVD (ํŠน์ด๊ฐ’ ๋ถ„ํ•ด) ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ถ„์‚ฐ์ด ์ตœ๋Œ€์ธ ์ถ•์„ ์ฐพ์Œ

  • ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•˜๋ฉด์„œ ์„œ๋กœ ์ง๊ตํ•˜๋Š” ์ƒˆ ์ถ•์„ ์ฐพ์Œ
    ⇒ ๋ฐ์ดํ„ฐ๊ฐ€ ํผ์ ธ์žˆ๋Š” ์ •๋„๊ฐ€ ๊ฐ€์žฅ ์ž˜ ๋ณด์กด๋˜๋Š” ์„ ์„ ์ฐพ๋Š” ๊ฒƒ
         (๋ถ„์‚ฐ์„ ๋ณด์กดํ•œ๋‹ค == ์ตœ๋Œ€ํ•œ ์›๋ž˜์˜ ์ •๋ณด๋ฅผ ๋ณด์กดํ•œ๋‹ค)
  • ๊ณ ์ฐจ์› ๊ณต๊ฐ„์˜ ํ‘œ๋ณธ๋“ค์„ ์„ ํ˜• ์—ฐ๊ด€์„ฑ์ด ์—†๋Š” ์ €์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜

PCA ํŒŒ๋ผ๋ฏธํ„ฐ

  • components_
    • array, shape (n_components, n_features)
    • n_feature dimension ๊ณต๊ฐ„์—์„œ์˜ ์ฃผ์„ฑ๋ถ„ ์ถ•
    • data ์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€๋กœ ๋ณด์กดํ•˜๋Š” ๋ฐฉํ–ฅ
    • explained_variance_ ์— ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ๋˜์–ด ์žˆ์Œ
  • explained_variance_
    • shape (n_components,)
    • ์„ ํƒํ•œ ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์—์„œ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ์–‘
  • explained_variance_ratio_
    • shape (n_components,)
    • ์„ ํƒํ•œ ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ๋ฐฑ๋ถ„์œจ์ž…๋‹ˆ๋‹ค.

PCA ์‹ค์Šต

# ๊ฐ ํ–‰์€ ๊ณ ๊ฐ์„ ๋‚˜ํƒ€๋‚ด๊ณ  ๊ฐ ์—ด์€ ๊ณ ๊ฐ์˜ ์†์„ฑ ํ‘œ์‹œ
# ์ง€๋‚œ๋‹ฌ์— ํƒˆํšŒํ•œ ๊ณ ๊ฐ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ํฌํ•จ (Churn 1.0 - ํƒˆํšŒ, 0.0 - ์œ ์ง€)
# 28๊ฐœ์˜ ํ”ผ์ฒ˜๋ฅผ ์ฐจ์› ์ถ•์†Œํ•˜์—ฌ ํ•ด๋‹น ๊ณ ๊ฐ์ด ํƒˆํšŒํ•  ๊ฒƒ์ธ์ง€๋ฅผ ์˜ˆ์ธก
# 2์ฐจ์› ์ƒ์— ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— pca๋ฅผ ํ†ตํ•ด 2์ฐจ์›์œผ๋กœ ์ถ•์†Œํ•˜๋Š” ๊ฒƒ

churn_df = pd.read_csv("datasets/ChurnData.csv")
# ๋ฐ์ดํ„ฐ ํ™•์ธ
churn_df.head()
# ์นผ๋Ÿผ ํ™•์ธ
churn_df.columns
# Index(['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',
       'callcard', 'wireless', 'longmon', 'tollmon', 'equipmon', 'cardmon',
       'wiremon', 'longten', 'tollten', 'cardten', 'voice', 'pager',
       'internet', 'callwait', 'confer', 'ebill', 'loglong', 'logtoll',
       'lninc', 'custcat', 'churn'],
      dtype='object')
# churn : ์˜ˆ์ธกํ•  ํ”ผ์ฒ˜

# shape ํ™•์ธ
churn_df.shape # (200,28)

X = churn_df.loc[:, columns]
y = churn_df['churn']

tenure age address income ed employ equip callcard wireless longmon ... pager internet callwait confer ebill loglong logtoll lninc custcat churn

  tenure age address income ed employ equip   loglong logtoll lninc custcat churn
0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 ... 1.482 3.033 4.913 4.0 1.0
1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 ... 2.246 3.240 3.497 1.0 1.0
2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 ... 1.841 3.240 3.401 3.0 0.0
3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 ... 1.800 3.807 4.331 4.0 0.0
4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 ... 1.960 3.091 4.382 3.0 0.0

๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ

# train / test dataset split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# ์ตœ์ข… ๋ฐ์ดํ„ฐ shape ํ™•์ธ
print(X_train.shape) # (160, 27)
print(X_test.shape) # (40, 27)
# ์Šค์ผ€์ผ๋ง
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

PCA ์ ์šฉ ์ „, logistic regression ๊ฒฐ๊ณผ : ํƒˆํšŒ ์—ฌ๋ถ€ ์˜ˆ์ธก

# Fitting Logistic Regression
clf = LogisticRegression(solver='lbfgs', random_state=0)
clf.fit(X_train, y_train)

# predict test set
y_pred = clf.predict(X_test)
# Confusion matrix
accuracy_score(y_test, y_pred) # 0.775

PCA ์ ์šฉ ํ›„, logistic regression ๊ฒฐ๊ณผ : ํƒˆํšŒ ์—ฌ๋ถ€ ์˜ˆ์ธก

PCA ์ ์šฉ : ํ”ผ์ฒ˜ ๊ฐœ์ˆ˜ 27 -> 2

# Apply kernel PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=2)  # 2๊ฐœ component๋กœ ์ฐจ์› ์ถ•์†Œ

X_train_pca = pca.fit_transform(X_train) # pca ํ•จ์ˆ˜๊ฐ€ ๋‹ค์ฐจ์› ๊ณต๊ฐ„์˜ ๊ณต๋ถ„์‚ฐ(๋ถ„์‚ฐ์ด ์ตœ๋Œ€ํ™” ๋˜๋Š” ์ถ•)์„ ์ฐพ์Œ -> ๋‘๊ฐœ์˜ ๋งคํŠธ๋ฆญ์Šค๋กœ๋ ฅ
X_test_pca = pca.transform(X_test) # ๋™์ผํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„๊ฒƒ์œผ๋กœ ํ•ด์•ผํ•˜๋ฏ€๋กœ fit_transform ์ด ์•„๋‹Œ transform๋งŒ

print("์ฐจ์› ์ถ•์†Œ๋œ X_train :", X_train_pca.shape) # (160, 2)
print("์ฐจ์› ์ถ•์†Œ๋œ X_test :", X_test_pca.shape) # (40, 2)
print(pca.components_.shape) # (2,27)

print("์ฒซ๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„(๊ณ ์œ ๋ฒกํ„ฐ) :")
print(pca.components_[0])

์ฒซ๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„(๊ณ ์œ ๋ฒกํ„ฐ) :
[0.18870382 0.09407626 0.06999421 0.02405283 0.08039882 0.10532452
 0.0913006  0.22012759 0.24022929 0.17035807 0.26783154 0.16389682
 0.25017627 0.28326203 0.17638143 0.27435707 0.22993114 0.24062665
 0.26112415 0.0747603  0.22729268 0.21587761 0.06959358 0.17457864
 0.17900112 0.07082588 0.29324012]

print("๋‘๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„(๊ณ ์œ ๋ฒกํ„ฐ) :")
print(pca.components_[1])

๋‘๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„(๊ณ ์œ ๋ฒกํ„ฐ) :
[ 0.2917276   0.18411246  0.24507417  0.04409899 -0.16349343  0.2142686
 -0.23252316  0.07599265 -0.23520159  0.29224485 -0.06742566 -0.24108039
  0.12944092 -0.20861384  0.29806582  0.04425842  0.22376079 -0.18866967
 -0.17737835 -0.24830636 -0.0674659  -0.05215805 -0.21541163  0.28738933
  0.03136523  0.07488121 -0.10196478]

print('์„ค๋ช…๋œ ๋ถ„์‚ฐ(๊ณ ์œ ๊ฐ’)์˜ ๋น„์œจ: {}, ๋‘ ์„ฑ๋ถ„์˜ ํ•ฉ: {:.2f}'
      .format(pca.explained_variance_ratio_,sum(pca.explained_variance_ratio_)))

์„ค๋ช…๋œ ๋ถ„์‚ฐ(๊ณ ์œ ๊ฐ’)์˜ ๋น„์œจ: [0.25193472 0.21764464], ๋‘ ์„ฑ๋ถ„์˜ ํ•ฉ: 0.47

Logistic regression ์ ์šฉ

# Fitting Logistic Regression
clf = LogisticRegression(solver='lbfgs', random_state=0)
clf.fit(X_train_pca, y_train)

# predict test set
y_pred = clf.predict(X_test_pca)
y_pred  # array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.])

# Accuracy Score
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred) # 0.725 -> ์ฐจ์› ์ถ•์†Œ ํ›„์—๋„ ๋ณ„ ์ฐจ์ด๊ฐ€ ์—†์Œ : ์ •๋ณด ์†์‹ค์ด ๊ฑฐ์˜ ์—†

์‹œ๊ฐํ™” : ์ฐจ์›์ถ•์†Œ๋œ churn data

→ 2๊ฐœ๋กœ ์ถ•์†Œ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํ‰๋ฉด์ƒ์˜ ์‹œ๊ฐํ™” ๊ฐ€๋Šฅ

→ X_train_pca ⇒ X1, X2 ์ถ•์œผ๋กœ ์‹œ๊ฐํ™”

→ y_train : 0,1์„ ์ƒ‰์œผ๋กœ ๊ตฌ๋ถ„ (0 : ํƒˆํšŒ X, 1 : ํƒˆํšŒ O)