DATA - 29. 이변량 (Bivariate) 데이터 시각화

Notice

소개(About)

Recent Comments

Recent Posts

Archives

Link

Github

관리 메뉴

귀퉁이 서재

DATA - 29. 이변량 (Bivariate) 데이터 시각화 본문

데이터 분석

DATA - 29. 이변량 (Bivariate) 데이터 시각화

Baek Kyun Shin 2019. 6. 18. 22:50

지난 글인 단변량 데이터 시각화에서는 가장 단순한 히스토그램과 막대 그래프에 대해 알아봤습니다. 이번 챕터에서는 변수가 두개인 이변량 데이터를 시각화하는 그래프에 대해 알아보겠습니다.

제 깃헙에서 전체 코드와 데이터를 받으실 수 있습니다.

import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

fuel_econ = pd.read_csv('./fuel-econ.csv')

두개의 수치형 데이터 (Two numeric)

두개의 수치형 데이터를 시각화 하는 방법에는 scatter plot, heat map, line plot이 있습니다. 각각에 대해 알아보겠습니다. 데이터는 수치형 데이터와 카테고리형 데이터로 나뉘는데 자세한 것은 DATA - 5. 데이터의 종류 (양적 데이터, 질적 데이터)을 참고하시기 바랍니다.

Scatter plot

plt.scatter

plt.scatter(data = fuel_econ, x = 'city', y = 'highway', alpha = 1/8)
plt.xlabel('City Fuel Eff. (mpg)')
plt.ylabel('Highway Fuel Eff. (mpg)');

sb.regplot

sb.regplot(data=fuel_econ, x='city', y='highway', fit_reg=False,
          scatter_kws={'alpha': 0.2})
plt.xlabel('City Fuel Eff. (mpg)')
plt.ylabel('Highway Fuel Eff. (mpg)');

Heat map

bins_x = np.arange(0.6, fuel_econ['displ'].max()+0.4, 0.4)
bins_y = np.arange(0, fuel_econ['co2'].max()+50, 50)
plt.hist2d(data = fuel_econ, x = 'displ', y = 'co2', bins = [bins_x, bins_y], 
           cmap = 'viridis_r', cmin = 0.5)
plt.colorbar()
plt.xlabel('Displacement (l)')
plt.ylabel('CO2 (g/mi)');

아래는 Heat map에 수치 데이터를 포현하는 코드입니다.

bins_x = np.arange(0.6, fuel_econ['displ'].max()+0.4, 0.4)
bins_y = np.arange(0, fuel_econ['co2'].max()+50, 50)
h2d = plt.hist2d(data = fuel_econ, x = 'displ', y = 'co2', bins = [bins_x, bins_y], 
           cmap = 'viridis_r', cmin = 0.5)
plt.colorbar()
plt.xlabel('Displacement (l)')
plt.ylabel('CO2 (g/mi)')

counts = h2d[0]

# loop through the cell counts and add text annotations for each
for i in range(counts.shape[0]):
    for j in range(counts.shape[1]):
        c = counts[i,j]
        if c >= 250: # increase visibility on darkest cells
            plt.text(bins_x[i]+0.2, bins_y[j]+25, int(c),
                     ha = 'center', va = 'center', color = 'white')
        elif c > 0:
            plt.text(bins_x[i]+0.2, bins_y[j]+25, int(c),
                     ha = 'center', va = 'center', color = 'black')

Line plot

# set bin edges, compute centers
bin_size = 10.0
xbin_edges = np.arange(10.0, fuel_econ['city'].max()+bin_size, bin_size)
xbin_centers = (xbin_edges + bin_size/2)[:-1]

# compute statistics in each bin
data_xbins = pd.cut(fuel_econ['city'], xbin_edges, right=False, include_lowest=True)
y_means = fuel_econ['highway'].groupby(data_xbins).mean()
y_sems = fuel_econ['highway'].groupby(data_xbins).sem()

# plot the summarized data
plt.errorbar(x=xbin_centers, y=y_means.values, yerr=y_sems.values)
plt.xlabel('city')
plt.ylabel('highway');

한개의 수치형 데이터, 한개의 카테고리형 데이터

하나는 수치형 데이터고, 나머지 하나는 카데고리형 데이터일 때는 violin plot, box plot을 통해 표현할 수 있습니다.

Violin plot (basic form)

sb.violinplot(data=fuel_econ, x='fuelType', y='displ')
plt.xticks(rotation=15);

Violin plot (데이터 타입 변경)

sedan_classes = ['Minicompact Cars', 'Subcompact Cars', 'Compact Cars', 'Midsize Cars', 'Large Cars']
pd_ver = pd.__version__.split(".")
if (int(pd_ver[0]) > 0) or (int(pd_ver[1]) >= 21): # v0.21 or later
    vclasses = pd.api.types.CategoricalDtype(ordered=True, categories=sedan_classes)
    fuel_econ['VClass'] = fuel_econ['VClass'].astype(vclasses)
else: # pre-v0.21
    fuel_econ['VClass'] = fuel_econ['VClass'].astype('category', ordered=True,
                                                     categories=sedan_classes)

base_color = sb.color_palette()[0]
sb.violinplot(data=fuel_econ, x='VClass', y='displ', 
              color=base_color, inner='quartile')
plt.xticks(rotation=15);

Box plot

sb.boxplot(data=fuel_econ, x='VClass', y='displ', color=base_color)
plt.xticks(rotation=15);

두개의 카테고리형 데이터

Clustered bar chart

fuel_econ_sub = fuel_econ.loc[fuel_econ['fuelType'].isin(['Premium Gasoline', 'Regular Gasoline'])]

ax = sb.countplot(data=fuel_econ_sub, x='VClass', hue='fuelType')
ax.legend(loc=4, framealpha=1, title='fuelType') # lower right, no transparency
plt.xticks(rotation=15);

Heat map

type_counts = fuel_econ_sub.groupby(['VClass', 'fuelType']).size()
type_counts = type_counts.reset_index(name = 'count')
type_counts = type_counts.pivot(index='fuelType', columns='VClass', values='count')
sb.heatmap(type_counts, annot = True, fmt = 'd');

Faceting

THRESHOLD = 80
make_frequency = fuel_econ['make'].value_counts()
idx = np.sum(make_frequency > THRESHOLD)

most_makes = make_frequency.index[:idx]
fuel_econ_sub = fuel_econ.loc[fuel_econ['make'].isin(most_makes)]

make_means = fuel_econ_sub.groupby('make').mean()
comb_order = make_means.sort_values('comb', ascending=False).index

# plotting
g = sb.FacetGrid(data=fuel_econ_sub, col='make', col_wrap=6, height=2,
                 col_order=comb_order)
g.map(plt.hist, 'comb', bins=np.arange(12, fuel_econ_sub['comb'].max()+2, 2))
g.set_titles('{col_name}');

여기서 idx = 18입니다. make_frequency > THRESHOLD를 하면 각 row별로 True, False가 나오고 True=1, False=0으로 계산해서, sum을 구해줍니다. 따라서 idx는 frequency가 THRESHOLD 이상인 make의 개수를 구해준 것입니다.

저작자표시 비영리 변경금지 (새창열림)

'데이터 분석' 카테고리의 다른 글

Data Analyst Nano Degree를 끝내며.. (4)	2019.07.04
DATA - 30. 다변량 (multivariate) 데이터 시각화 (0)	2019.06.24
DATA - 28. 단변량 (Univariate) 데이터 시각화 (0)	2019.06.14
DATA - 27. 정확하고 효율적인 데이터 시각화(Data Visualization)를 위한 고려사항 (0)	2019.06.12
DATA - 26. 데이터 시각화(Data Visualization)의 중요성 (0)	2019.06.09

'데이터 분석' Related Articles

Comments

귀퉁이 서재

DATA - 29. 이변량 (Bivariate) 데이터 시각화 본문

DATA - 29. 이변량 (Bivariate) 데이터 시각화

import

두개의 수치형 데이터 (Two numeric)

Scatter plot

Line plot

한개의 수치형 데이터, 한개의 카테고리형 데이터

Violin plot (basic form)

Violin plot (데이터 타입 변경)

Box plot

두개의 카테고리형 데이터

Clustered bar chart

Heat map

Faceting

'데이터 분석' 카테고리의 다른 글

티스토리툴바