Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

Charming ['ㅡ'] Ham !

Python | 데이터 시각화하기 / Visualization 본문

지식 정보 공유/코딩 : Coding

Python | 데이터 시각화하기 / Visualization

Charming_ham 2021. 1. 20. 12:39

728x90

Visualization, 시각화¶

시각화를 위해 사용되는 라이브러리는 Matplotlib 와 Seaborn 입니다. 먼저 라이브러리를 설치해 봅시다.

$ pip install matplotlib
$ pip install seaborn

파이썬에서 시각화는 실제와 크게 다르지 않다.

우선 막대그래프를 먼저 그려봅시다!

그래프를 통한 시각화 순서¶

아주 간단한 순서로 시각화를 할 수 있다.

그래프에 그릴 데이터들 준비하기 : 항목, 수치 데이터 등
도화지 펴기 : figure 설정
축그리기 : add_subplot()
라벨, 타이틀 달기
보여주기 : plt.show()

1. 막대그래프 그리기¶

In [6]:

# 그래프를 그리기 위한 모듈 가져오기
import matplotlib.pyplot as plt

# IPython 에서 사용하는 매직 메소드. 
# Rich output 에 대한 표현방식으로 그림, 소리, 애니메이션과 같은 결과물을 의미
%matplotlib inline

# 그래프 데이터의 항목
subject = ['English', 'Match', 'Korean', 'Science', 'Computer']

# 그래프 데이터의 수치 데이터
points = [40, 90, 50, 60, 100]

# 축 그리기
# 하나의 도화지(figure) 를 펴고, 그래프(subplot) 를 추가, 축을 그리는 과정
# fig 만 입력하면, figure 의 사이즈만 출력된다.
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)

# 그래프 그리기
ax1.bar(subject, points)

# 라벨, 타이틀 달기
plt.xlabel('Subject')
plt.ylabel('Point')
plt.title("Yuna's Test Result")

# 보여주기
# 만든 그래프를 이미지로 저장
plt.savefig("./barpolt.png")

# 결과 값으로 이미지 출력
plt.show()

데이터 정의¶

In [11]:

import matplotlib.pyplot as plt
%matplotlib inline

#그래프 데이터 
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

도화지를 피고, 그래프를 그릴 준비하기¶

In [7]:

# 도화지의 크기 역시 설정 가능하다.
fig = plt.figure(figsize = (5, 2))
ax1 = fig.add_subplot(1, 1, 1)

몇개의 그래프를 그릴지 정하기¶

In [8]:

fig = plt.figure()

# 또한 여러개의 축을 그릴 수 있다.
# (2, 2, 2) 이 부분의 의미는 2x2 (총 4개)로 그릴껀데 
# 1번째, 2번째, 4번재 위치에 그린겠다는 것이다.
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 4)

그래프에 그릴 항목들 준비 및 그래프 그리기¶

In [9]:

# 그래프 데이터 준비
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

# 축 그리기
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

# 그래프 그리기
ax1.bar(subject,points)

Out[9]:

<BarContainer object of 5 artists>

그래프에 요소 추가(라벨 추가 및 꾸미기)¶

In [10]:

#그래프 데이터 
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

# 축 그리기
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

# 그래프 그리기
ax1.bar(subject, points)

# 라벨, 타이틀 달기
plt.xlabel('Subject')
plt.ylabel('Points')
plt.title("Yuna's Test Result")

Out[10]:

Text(0.5, 1.0, "Yuna's Test Result")

2. 선 그래프 그리기¶

데이터 정의(준비)¶

$ wget https://aiffelstaticprd.blob.core.windows.net/media/documents/AMZN.csv
$ mv AMZN.csv ~/data_represent/data

준비하신 데이터나 위 wget 을 통해 데이터를 다운 받으신 후, 원하신 디렉토리에 저장해주세요. 데이트 불러오는 경로나 방식은 조금씩 다를 수 있습니다.

판다스의 시리즈 데이터를 활용하여 그래프 그리기¶

판다스의 시리즈 데이터는 선그래프를 그리기에 적합한 자료 구조를 가지고 있다. 아래의 코드에서 price.plot 부분은 판다스의 plot 을 사용하면서 matplotlib 의 subplot 공간에 축을 그린 것이다.
시각화는 기본 구조는 간단하나 이를 꾸며주는 과정이 상대적으로 어렵다고 볼 수 있다. 전체 코드를 한번 보고, 천천히 살펴보자

In [17]:

from datetime import datetime
import pandas as pd
import os

# 그래프 데이터
csv_path = os.getenv("HOME") + "/data_represent/data/AMZN.csv"
data = pd.read_csv(csv_path, index_col = 0, parse_dates = True)

# 판다스의 시리즈 데이터
price = data['Close']

# 축 그리기 및 좌표축 설정
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
price.plot(ax = ax, style = 'black')

# 좌표축의 범위 설정
plt.ylim([1600, 2200])
plt.xlim(['2019-05-01', '2020-03-01'])


# 주석달기
# 그래프 안에 추가적으로 글자나 화살표 등 주석을 그릴때는 annotate() 매소드를 사용
important_data = [(datetime(2019, 6 ,3), "Low Price"), (datetime(2020, 2, 19), "Peak Price")]

for d, label in important_data :
    ax.annotate(label, xy = (d, price.asof(d) + 10),
               xytext = (d, price.asof(d) + 100),
               arrowprops = dict(facecolor = 'red'))
    
# 그리드, 타이틀 달기
plt.grid()
ax.set_title("StockPrice")

# 보여주기
plt.show()

plt.plot() 로 그래프 그리기¶

위에서 도화를 펴고, 축을 그리는 과정이 필요하다고 했다. (figure, add_subplot) 하지만 이 과정을 생략할수도 있다. 바로 plt.plot() 를 통해 그래프를 그리면, matplotlib 의 최근 사용된 figure 와 add_subplot() 을 불러온다.

명확히 말하자면 생략이라기보단 이전에 했던 양식을 불러오는 것으로 다른 양식으로 그리려한다면 생략이라고 할 순 없을 것이다.

plt.plot() 의 요소로는 x 데이터, y 데이터, 마커옵션, 색상 등의 포함된다.

In [19]:

import numpy as np

# np.linspace 는 넘파이의 함수로, 그래프 그리기에 유용하게 사용되며
# np.linspace(시작 값, 끝 값, 스텝) 의 서식으로 사용한다.
# 즉, 아래 코드는 0부터 10 까지의 값을 100번으로 나눠 표시한 값이다.
x = np.linspace(0, 10, 100)

plt.plot(x, np.sin(x), 'o')
plt.plot(x, np.cos(x), '--', color = 'black')
plt.show()

In [20]:

# 서브플롯도 plt.subplot 를 이용해 추가가 가능

x = np.linspace(0, 10, 100)

plt.subplot(2, 1, 1)
plt.plot(x, np.sin(x), 'orange', 'o')

plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x), 'orange')
plt.show()

linestyple, marker 옵션¶

라인 스타일은 plot() 의 인자로 들어가며 다양한 방법으로 표기할 수 있다.

In [21]:

# 다양한 라인스타일들

x = np.linspace(0, 10, 100)

plt.plot(x, x + 0, linestyle='solid') 
plt.plot(x, x + 1, linestyle='dashed') 
plt.plot(x, x + 2, linestyle='dashdot') 
plt.plot(x, x + 3, linestyle='dotted')
plt.plot(x, x + 0, '-g') # solid green 
plt.plot(x, x + 1, '--c') # dashed cyan 
plt.plot(x, x + 2, '-.k') # dashdot black 
plt.plot(x, x + 3, ':r'); # dotted red
plt.plot(x, x + 4, linestyle='-') # solid 
plt.plot(x, x + 5, linestyle='--') # dashed 
plt.plot(x, x + 6, linestyle='-.') # dashdot 
plt.plot(x, x + 7, linestyle=':'); # dotted

판다스로 그래프 그리기¶

판다스 역시, plot() 를 통해서 여러 그래프를 그릴 수 있다. 주로 matplotlib 와 연계하여 사용되며, 여러 메소드의 기능은 다음과 같다.

pandas.plot 메소드들

* label :그래프의 범례 이름
* ax : 그래프를 그릴 matplotlib 의 서브플롯 객체.
* style : matplotlib 에 전달할 'ko--' 같은 스타일의 문자열
* alpha : 투명도 (0~1)
* kind : 그래프의 종류 (line, bar, barh, kde)
* logy : Y 축에 대한 로그 스케일
* use_index : 객체의 색인을 눈금 이름으로 사용할지 여부
* rot : 눈금 이름을 로테이션 (0~360)
* xticks, yticks : x 축, y 축으로 사용할 값
* xlim, ylim : x 축, y 축 범위
* grid : 축의 그리프 표시여부

판다스 데이터가 데이터프레임일 때, plot 메소드들

* subplots : 각 데이터프레임 컬럼을 독립된 서브플롯에 그린다.
* sharex : subplots = True 면 같은 X 축을 공유하고, 눈금과 범위를 공유한다.
* sharey : subplots = True 면 같은 Y 축을 공유한다.
* figsize : 그래프의 크기. 튜플로 지정된다.
* title : 그래프의 제목을 문자열로 지정된다.
* sort_colunms : 컬럼을 오름차순정렬하여 그린다.

In [23]:

# 막대그래프 그려보기

fig, axes = plt.subplots(2, 1)
data = pd.Series(np.random.rand(5), index=list('abcde'))
data.plot(kind='bar', ax=axes[0], color='blue', alpha=1)
data.plot(kind='barh', ax=axes[1], color='red', alpha=0.3)

Out[23]:

<AxesSubplot:>

In [25]:

# 선 그래프 그려보기

df = pd.DataFrame(np.random.rand(6,4), columns=pd.Index(['A','B','C','D']))
df.plot(kind='line')

Out[25]:

<AxesSubplot:>

자주 사용되는 그래프들 그려보기¶

자주 사용되는 그래프들은 다음과 같다.

막대그래프
꺾은선 그래프
산점도
히스토그램

데이터를 준비할 때, Seaborn 의 load_dataset() 을 활용하면 API 를 통해 쉽게 예제 데이터를 다운 받을 수 있다.

이렇게 다운받은 데이터의 기본 저장경로는 ~/seaborn-data/ 이다.

In [27]:

# 데이터준비

import pandas as pd
import seaborn as sns

tips = sns.load_dataset('tips')

In [39]:

# 데이터 살펴보기 (EDA)

# 판다스를 통해 데이터 살펴보기
df = pd.DataFrame(tips)

# 상위 5개 확인
df.head()

Out[39]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [38]:

# 데이터의 형태 보기
df.shape

Out[38]:

(244, 7)

In [37]:

# 데이터 상세보기(통계적 수치)
df.describe()

Out[37]:

	total_bill	tip	size
count	244.000000	244.000000	244.000000
mean	19.785943	2.998279	2.569672
std	8.902412	1.383638	0.951100
min	3.070000	1.000000	1.000000
25%	13.347500	2.000000	2.000000
50%	17.795000	2.900000	2.000000
75%	24.127500	3.562500	3.000000
max	50.810000	10.000000	6.000000

In [36]:

# 데이터의 개요보기
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.3 KB

In [35]:

# 범주형 변수의 카테고리별 개수 확인

print(df['sex'].value_counts())
print("===========================")

print(df['time'].value_counts())
print("===========================")

print(df['smoker'].value_counts())
print("===========================")

print(df['day'].value_counts())
print("===========================")

print(df['size'].value_counts())
print("===========================")

Male      157
Female     87
Name: sex, dtype: int64
===========================
Dinner    176
Lunch      68
Name: time, dtype: int64
===========================
No     151
Yes     93
Name: smoker, dtype: int64
===========================
Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64
===========================
2    156
3     38
4     37
5      5
6      4
1      4
Name: size, dtype: int64
===========================

범주형 데이터¶

범주형 데이터는 주로 막대그래프를 사용하여 수치를 요약하며, 주로 가로, 세로, 누적, 그룹화된 막대 그래프를 사용한다.

데이터에서 범주형 변수는 sex, smoker, day, time, size 등의 컬럼이 됩니다.

막대 그래프 (bar graph)¶

matplotlib 에 데이터를 요소로 넣기위해 판다스 데이터를 바로 사용할 수 없으며, x, y 에 시리즈 형태의 데이터, 리스트 형태의 데이터로 나눠주어야 한다.

In [42]:

# tip 컬럼을 sex 에 대한 평균으로 표현

grouped = df['tip'].groupby(df['sex'])

grouped.mean()

Out[42]:

sex
Male      3.089618
Female    2.833448
Name: tip, dtype: float64

In [44]:

# sex 따른 팁 받은 횟수 (size)

grouped.size()

Out[44]:

sex
Male      157
Female     87
Name: tip, dtype: int64

In [45]:

# sex 따른 팁 데이터 시각화(막대그래프)

import numpy as np

sex = dict(grouped.mean())
x = list(sex.keys())
y = list(sex.values())

plt.bar(x, y)
plt.ylabel('tip[$]')
plt.title('Tip by Sex')
plt.legend()

No handles with labels found to put in legend.

Out[45]:

<matplotlib.legend.Legend at 0x7f63e49e7c90>

In [46]:

# Seaborn 과 Matplotlib 를 이용한 간단한 방법

sns.barplot(data = df, x = 'sex', y = 'tip')

Out[46]:

<AxesSubplot:xlabel='sex', ylabel='tip'>

In [47]:

# Matplot 를 사용하여 figsize, title 등 옵션 추가

plt.figure(figsize = (10, 6))
sns.barplot(data = df, x = 'sex', y = 'tip')
plt.ylim(0, 4)
plt.title('Tip by sex')

Out[47]:

Text(0.5, 1.0, 'Tip by sex')

In [48]:

# 요일에 따른 tip 그래프

plt.figure(figsize = (10, 6))
sns.barplot(data = df, x = 'day', y = 'tip')
plt.ylim(0, 4)
plt.title('Tip by day')

Out[48]:

Text(0.5, 1.0, 'Tip by day')

In [54]:

# violineplot 을 이용한 범주형 그래프 나타내기

fig = plt.figure(figsize = (13, 10))
ax1 = fig.add_subplot(2, 2, 1)
sns.barplot(data = df, x = 'day', y = 'tip', palette = 'ch:.25')
ax2 = fig.add_subplot(2, 2, 2)
sns.barplot(data = df, x = 'sex', y = 'tip')
ax3 = fig.add_subplot(2, 2, 4)
sns.violinplot(data = df, x = 'sex', y = 'tip')
ax4 = fig.add_subplot(2, 2, 3)
sns.violinplot(data = df, x = 'day', y = 'tip', palette = 'ch:.25')

Out[54]:

<AxesSubplot:xlabel='day', ylabel='tip'>

In [55]:

# catplot 를 이용한 막대그래프

sns.catplot(x = 'day', y = 'tip', jitter = False, data = tips)

Out[55]:

<seaborn.axisgrid.FacetGrid at 0x7f63e01b6410>

수치형 데이터¶

수치형 데이터를 나타내는데 가장 좋은 그래프는 산점도 그래프, 선 그래프이다.

전체 음식가격에 따른 팁 데이터를 시각화 해보자!

산점도(sctter plot)¶

In [56]:

sns.scatterplot(data = df, x = 'total_bill', y = 'tip', palette = "ch:r=-.2, d=.3_r")

Out[56]:

<AxesSubplot:xlabel='total_bill', ylabel='tip'>

In [57]:

sns.scatterplot(data = df, x = 'total_bill', y = 'tip', hue = 'day')

Out[57]:

<AxesSubplot:xlabel='total_bill', ylabel='tip'>

선 그래프 (line graph) 를 통한 시각화¶

plot 의 기본옵션으로 사용되는 그래프이다.

In [64]:

# 랜덤한 수치를 통해 선 그래프를 표현

plt.plot(np.random.rand(50).cumsum())

Out[64]:

[<matplotlib.lines.Line2D at 0x7f63e042ad50>]

In [65]:

x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x), 'o')
plt.plot(x, np.cos(x))
plt.show()

In [66]:

# Seaborn 을 통한 선 그래프 그리기

sns.lineplot(x, np.sin(x))
sns.lineplot(x, np.cos(x))

/home/aiffel/anaconda3/envs/aiffel/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/home/aiffel/anaconda3/envs/aiffel/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

Out[66]:

<AxesSubplot:>

히스토그램 (Histogram)¶

도수분포료를 그래프로 나타낸 것으로,

가로축 : 계급 (변수의 구간, bin (bucket))

세로축 : 도수 (빈도수, frequecncy))

전체 총량 : n

으로 표기한다.

In [69]:

# x1, x2 의 평균은 100, 130
# 도수 50개를 구간으로 표기, 확률 밀도가 아닌 빈도로 표시
# 히스토그램

# 그래프 데이터

mu1, mu2, sigma = 100, 130, 15
x1 = mu1 + sigma*np.random.randn(10000)
x2 = mu2 + sigma*np.random.randn(10000)

# x 축 그리기
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)

# 그래프 그리기
patches = ax1.hist(x1, bins = 50, density = False)
patches = ax1.hist(x2, bins = 50, density = False, alpha = 0.5)
ax1.xaxis.set_ticks_position('bottom')
ax1.yaxis.set_ticks_position('left')

# 라벨, 타이틀 달기
plt.xlabel('Bins')
plt.ylabel('Number of Valuues in bin')
ax1.set_title('Two Frequency Distributions')

# 보여주기
plt.show()

In [70]:

# tip 데이터를 활용한 히스토그램

sns.distplot(df['total_bill'])
sns.distplot(df['tip'])

/home/aiffel/anaconda3/envs/aiffel/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/home/aiffel/anaconda3/envs/aiffel/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Out[70]:

<AxesSubplot:xlabel='tip', ylabel='Density'>

In [72]:

# 전체 결제 금액 대비 팁의 비율을 나타내는 히스토그램

df['tip_pct'] = df['tip'] / df['total_bill']
df['tip_pct'].hist(bins = 50)

Out[72]:

<AxesSubplot:>

In [74]:

# kind = 'kde' 를 통한 확률 밀도 그래프
# 밀도 그래프 : 연속된 확률분포를 나타내며,
# 일반적으로 kernels 메서드를 섞어 분포를 근사하는 식으로 사용
# 정규분포로 나타낼 수 있으며,
# 밀도그래프는 KDE (Kernel Density Estimate) 즉, 커널 밀도 추정 그래프이다.

df['tip_pct'].plot(kind = 'kde')

Out[74]:

<AxesSubplot:ylabel='Density'>