Pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ๋ฒ•

BoostCamp AI Tech

Python

Pandas

01/27/2021


๋ณธ ์ •๋ฆฌ ๋‚ด์šฉ์€ Naver BoostCamp AI Tech์˜ edwith์—์„œ ์ˆ˜๊ฐ•ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์‚ฌ์‹ค๊ณผ ๋‹ค๋ฅธ ๋ถ€๋ถ„์ด ์žˆ๊ฑฐ๋‚˜, ์ˆ˜์ •์ด ํ•„์š”ํ•œ ์‚ฌํ•ญ์€ ๋Œ“๊ธ€๋กœ ๋‚จ๊ฒจ์ฃผ์„ธ์š”.


Pandas

panel datas์—์„œ ๋‚˜์˜จ ์ด๋ฆ„์œผ๋กœ, ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ์˜ ์ฒ˜๋ฆฌ๋ฅผ ์ง€์›ํ•˜๋Š” Python์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค. Python์˜ ์—‘์…€์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ๊ณ ์„ฑ๋Šฅ array ๊ณ„์‚ฐ์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ numpy์™€ ํ†ตํ•ฉํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ ์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค.
  • ์ธ๋ฑ์‹ฑ, ์—ฐ์‚ฐ์šฉ ํ•จ์ˆ˜, ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜ ๋“ฑ์„ ์ œ๊ณตํ•œ๋‹ค.
  • ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ ํ†ต๊ณ„ ๋ถ„์„์„ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.
  • Tabular ๋ฐ์ดํ„ฐ(ํ…Œ์ด๋ธ”ํ˜• ๋ฐ์ดํ„ฐ)๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐ์— ์ตœ์ ํ™”๋˜์–ด์žˆ๋‹ค.

pandas_data_type

์ด ์ค‘, Data Table ์ „์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฐ์ฒด๋ฅผ DataFrame, Column ํ•˜๋‚˜์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹ ๊ฐ์ฒด๋ฅผ Series๋ผ๊ณ  ํ•œ๋‹ค.

Series

PYTHON
list_data = [1,2,3,4,5]
example_obj = Series(data = list_data)
example_obj
'''
0 1
1 2
2 3
3 4
4 5
dtype: int64
'''

data ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฐ›์•„ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, Numpy ์˜ ndarray ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์ง„, Pandas ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ wrapper ๊ฐ์ฒด๋ผ๊ณ  ๋ณผ์ˆ˜๋„ ์žˆ๋‹ค.(Subclass of numpy.ndarray)

ํŠน์ดํ•œ ์ ์€, ์œ„์™€ ๊ฐ™์ด [1,2,3,4,5] ๋ผ๋Š” ๋ฐ์ดํ„ฐ๋งŒ ๋“ค์–ด๊ฐ€๋Š”๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ด ๋ฐ์ดํ„ฐ์˜ ์ธ๋ฑ์Šค ๊ฐ’ [0,1,2,3,4] ๋„ ๊ฐ™์ด ์ €์žฅ๋œ๋‹ค. ์ด ์ธ๋ฑ์Šค๊ฐ’์€ ๋ฌธ์ž์—ด๋กœ๋„ ์ง€์ • ๊ฐ€๋Šฅํ•˜๋‹ค.

Series ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•  ๋•Œ, Index๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ƒ์„ฑํ•˜๋ฏ€๋กœ ๋งŒ์•ฝ data๊ฐ€ ์—†๋Š” index๊ฐ€ ์žˆ๋‹ค๋ฉด NaN๊ฐ’์ด ์ฑ„์›Œ์ง„๋‹ค.

  • data์™€ index๋ฅผ ๋”ฐ๋กœ ๋ช…์‹œํ•˜๋Š” ๋Œ€์‹  dict ํƒ€์ž…์„ ๋„ฃ์–ด ํ•œ๋ฒˆ์— ์ง€์ •ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
  • name์œผ๋กœ Series ๊ฐ์ฒด ์ž์ฒด์˜ ์ด๋ฆ„์„ ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.
  • index.name ์œผ๋กœ ์ธ๋ฑ์Šค์˜ ์ด๋ฆ„์„ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.
  • as_type ์œผ๋กœ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋‹ค.

PYTHON
# ์ธ๋ฑ์Šค ๋ฌธ์ž์—ด๋กœ ๋ฐ”๊พธ๊ธฐ
list_data = [1,2,3,4,5]
list_name = ["a","b","c","d","e"]
example_obj = Series(data = list_data, index=list_name)
example_obj
'''
a 1
b 2
c 3
d 4
e 5
dtype: int64
'''
# dict ํƒ€์ž…์œผ๋กœ ์ƒ์„ฑํ•˜๊ธฐ
dict_data = {"a":1, "b":2, "c":3, "d":4, "e":5}
example_obj = Series(dict_data, dtype=np.float32, name="example_data")
# series ๊ฐ์ฒด ์ด๋ฆ„ ์ง€์ •
example_obj.name = "number"
# index ์ด๋ฆ„ ์ง€์ •
example_obj.index.name = "alphabet"
# ๋ฐ์ดํ„ฐ ํƒ€์ž…(dtype) ๋ณ€๊ฒฝํ•˜๊ธฐ
example_obj = example_obj.astype(int)

Dataframe

Numpy ์˜ 2์ฐจ์› ndarray ์™€ ๋น„์Šทํ•œ ๊ฐ์ฒด์ด๋ฉฐ, Series ์™€ ๋‹ฌ๋ฆฌ (row) index ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ column index๋„ ์žˆ๋‹ค. ๊ฐ ์ปฌ๋Ÿผ์€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ปฌ๋Ÿผ์„ ๋–ผ๊ณ  ๋ถ™์ผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ „์ฒด ์‚ฌ์ด์ฆˆ๋Š” mutableํ•˜๋‹ค.

Series ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•  ๋•Œ์ฒ˜๋Ÿผ, Index๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ƒ์„ฑํ•˜๋ฏ€๋กœ ๋งŒ์•ฝ data๊ฐ€ ์—†๋Š” row ๋˜๋Š” column index๊ฐ€ ์žˆ๋‹ค๋ฉด NaN๊ฐ’์ด ์ฑ„์›Œ์ง„๋‹ค.

  • ์ธ๋ฑ์‹ฑ์œผ๋กœ loc ๊ณผ iloc ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. (Series ๊ฐ์ฒด๋„ ํ†ต์šฉ)
    • loc ์€ ์‹ค์ œ index name๊ณผ ์ผ์น˜ํ•˜๊ฒŒ ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค. (์ˆซ์ž, ๋ฌธ์ž์—ด)
    • iloc ์€ index์˜ position ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค.(0๋ถ€ํ„ฐ n๊นŒ์ง€์˜ ์ˆซ์ž)
    • ๋‘ ์ธ๋ฑ์‹ฑ ๋ชจ๋‘ row์™€ column์„ ๋ชจ๋‘ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.
  • column์— ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ํ• ๋‹นํ•  ์ˆ˜ ์žˆ๋‹ค.
    • comparison ๋“ฑ์œผ๋กœ boolean ๊ฐ’์„ ๋„ฃ์„ ์ˆ˜๋„ ์žˆ๋‹ค. (Numpy ์˜ fancy index )
  • df.T ๋กœ row์™€ column ์ธ๋ฑ์Šค๋ฅผ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค.
  • values ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ndarray๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
  • ์ปฌ๋Ÿผ์„ ์‚ญ์ œํ•  ๋•Œ, ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•  ์ˆ˜์žˆ๋‹ค.
    • ์ปฌ๋Ÿผ ์‚ญ์ œ ์‹œ, del df['์ปฌ๋Ÿผ๋ช…'] - ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ๊นŒ์ง€ ์‚ญ์ œ
    • df.drop('์ปฌ๋Ÿผ๋ช…', axis=n) ์›๋ณธ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์€ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๊ณ , ํ•ด๋‹น ์ปฌ๋Ÿผ์„ ์‚ญ์ œํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

PYTHON
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])
df.first_name # == df["first_name"], series ๋ฐ์ดํ„ฐํ˜•
# ์ธ๋ฑ์‹ฑ
s = pd.Series(np.nan, index=[49,48, 1, 2])
s.loc[:1] # ์ธ๋ฑ์Šค ์ด๋ฆ„์ด 1์ธ row ๊นŒ์ง€๋งŒ
'''
49 NaN
48 NaN
1 NaN
'''
s.iloc(:1] # ์ธ๋ฑ์Šค ์œ„์น˜๊ฐ€ 1์ธ row๊นŒ์ง€๋งŒ
'''
49 NaN
'''
# column์— ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ• ๋‹น
df.debt = df.age > 40 # column 'debt'์— comparison์˜ T/F ๊ฐ’ ์ž…๋ ฅํ•˜์—ฌ ๋ถ™์ž„

Selection & Drop

Selection ๋ฐฉ์‹

  • ๋ฐ์ดํ„ฐ์˜ ํƒ€์ž…์„ ์ฒดํ‚นํ•  ๋•Œ head().T ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ปฌ๋Ÿผ์— ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์ด ๋“ค์–ด๊ฐ”๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ๋„ ํ•œ๋‹ค.
  • n๊ฐœ์˜ ์ปฌ๋Ÿผ๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ง€์ •ํ•˜์—ฌ ํ•ด๋‹น ์ปฌ๋Ÿผ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ฝ‘์•„์˜ฌ ์ˆ˜ ์žˆ๋‹ค.
  • ์ปฌ๋Ÿผ๋ช…์„ ์ง€์ •ํ•˜์ง€ ์•Š์•˜์„ ์‹œ, index number๋กœ ํ•ด๋‹น row๋ฅผ ๋ฝ‘์•„์˜ฌ ์ˆ˜ ์žˆ๋‹ค.
  • Numpy์˜ fancy index์ฒ˜๋Ÿผ, ์กฐ๊ฑด์‹์„ ๋„ฃ์–ด ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
  • Index name ๋ณ€๊ฒฝ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.

PYTHON
# ์ปฌ๋Ÿผ ๋ฐ์ดํ„ฐ ํƒ€์ž… ํ™•์ธ
df.head(3).T
# ์ปฌ๋Ÿผ๋ช… ์ง€์ •
df["a"].head(2)
df[["a", "b", "c"]].head(3)
# ์ปฌ๋Ÿผ๋ช… ์—†์ด ์‚ฌ์šฉํ•˜๋Š” ์ธ๋ฑ์Šค ์ง€์ •์€ row๋ฅผ ํ‘œ์‹œํ•œ๋‹ค.
df[:3] # == df[[0,1,2]], 0,1,2 row ํ‘œ์‹œ
# ์กฐ๊ฑด์‹ ๋„ฃ์–ด selection
df[20 < "age" < 40]

loc, iloc ์„ ํ™œ์šฉํ•œ selection

loc, iloc์„ ํ™œ์šฉํ•œ selection๋„ ํ™•์ธํ•ด๋ณด์ž.

  • ๊ธฐ๋ณธ ๋ฐฉ์‹
    • ์ปฌ๋Ÿผ๋ช…์„ ๋ช…์‹œํ•˜๊ณ , index number๋ฅผ ๋„ฃ์–ด row์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•œ๋‹ค.
  • loc
    • index name์„ ๋„ฃ์–ด row๋ฅผ ์ง€์ •ํ•œ ๋’ค, ์ปฌ๋Ÿผ๋ช…์„ ๋ช…์‹œํ•œ๋‹ค.
  • iloc
    • ์ปฌ๋Ÿผ number๋ฅผ ์ง€์ •ํ•œ ๋’ค, index number๋ฅผ ์ง€์ •ํ•ด row๋„ ๋ช…์‹œํ•œ๋‹ค.

PYTHON
# basic
# ์ปฌ๋Ÿผ๋ช…๊ณผ row index ์ˆซ์ž
df[["name","street"]][:2]
# loc
# row index ๋ช…๊ณผ ์ปฌ๋Ÿผ๋ช…
df.loc[['c','e'],["name","street"]]
# iloc
# ์ปฌ๋Ÿผ ์ˆซ์ž์™€ row index ์ˆซ์ž
df.iloc[:2,:2]

Index ์žฌ๊ตฌ์„ฑ

๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์ด ์žˆ๋‹ค.

  1. ์ง์ ‘ ์ง€์ •
  2. reset_index ํ•จ์ˆ˜ ์‚ฌ์šฉ

PYTHON
# ์ง์ ‘ ์ง€์ •ํ•ด์ฃผ๊ธฐ
df.index = list(range(0,10))
# ๋””ํดํŠธ ์ธ๋ฑ์Šค๋ฅผ ์ถ”๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ฐ˜ํ™˜
# ๊ธฐ์กด ์ธ๋ฑ์Šค๋ฅผ ๋Œ€์ฒดํ•˜๊ณ ์‹ถ๋‹ค๋ฉด drop=True ์˜ต์…˜
df.reset_index(drop=True)
# ๋‹จ, inplace=True ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋ฉด ์›๋ณธ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋ณ€ํ™˜์„ '์ ์šฉ'์‹œํ‚จ๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ์›๋ณธ์„ ์œ ์ง€ํ•˜๋Š”๊ฒƒ์ด ์ปจ๋ฒค์…˜์ด์ง€๋งŒ, ๋ถ€๋“์ดํ•˜๊ฒŒ ๋ณ€๊ฒฝํ•ด์•ผํ•  ๊ฒฝ์šฐ inplace=True ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€ํ™˜์ƒํƒœ๋ฅผ '์ ์šฉ'์‹œํ‚จ๋‹ค.

Drop

  • df.drop(n) ์€ ์ธ๋ฑ์Šค๋ฅผ ์ง€์ •ํ•˜์—ฌ ํ•ด๋‹น row๋ฅผ ์‚ญ์ œํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
  • axis=1 ์˜ต์…˜์„ ์ง€์ •ํ•˜์—ฌ column์„ ์‚ญ์ œํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, inplace=True ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์›๋ณธ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

PYTHON
# 1๋ฒˆ row ์‚ญ์ œ
df.drop(1)
# 'city' column ์‚ญ์ œ
df.drop("city",axis=1)

๋งคํŠธ๋ฆญ์Šค ๋ณ€ํ™˜

์ด์™ธ์—๋„, as_matrix() ํ•จ์ˆ˜๋กœ Numpy ndarray ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

PYTHON
matrix = df.as_matrix()

Operations

Series Operation

๋ชจ๋“  ์—ฐ์‚ฐ์€ (row) index๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

๋”ฐ๋ผ์„œ ๊ฒน์น˜๋Š” index๊ฐ€ ์—†์„ ๊ฒฝ์šฐ NaN๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

PYTHON
s1 = Series(range(1,6), index=list("abcde"))
s2 = Series(range(5,11), index=list("bcdefg"))
# ๊ฒน์น˜๋Š” index๋Š” b,c,d,e
s1 + s2 # == s1.add(s2)
'''
a NaN
b 7.0
c 9.0
d 11.0
e 13.0
f NaN
g NaN
dtype: float64
'''

Dataframe Operation

Series ๋Š” index๋งŒ ์ฒดํฌํ–ˆ์ง€๋งŒ, Dataframe์€ column๊ฐ’๊ณผ index๋ฅผ ๊ฐ™์ด ๋ณธ๋‹ค.

๊ฒน์น˜๋Š” ์…€์ด ์—†์„ ๊ฒฝ์šฐ NaN ์„ ๋ฐ˜ํ™˜ํ•˜์ง€๋งŒ, fill_value= ์˜ต์…˜์„ ์‚ฌ์šฉํ•ด ๋‹ค๋ฅธ ๊ฐ’์„ ๋””ํดํŠธ๋กœ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

PYTHON
df1 = DataFrame(
np.arange(9).reshape(3,3),
columns=list("abc"))
df2 = DataFrame(
np.arange(16).reshape(4,4),
columns=list("abcd"))
# df2.loc[3]๊ณผ df2["d"]๊ฐ€ ๊ฒน์น˜์ง€ ์•Š์Œ
# df1 + df2
df1.add(df2,fill_value=0) # ๊ฒน์น˜์ง€ ์•Š์•„ NaN ๊ฐ’์ด ๋“ค์–ด๊ฐˆ ์œ„์น˜๋ฅผ 0์œผ๋กœ ์ฑ„์›€

Series + Dataframe

Series์™€ Dataframe ์€ ๋‹ค๋ฅธ ํƒ€์ž…์ด๋ฏ€๋กœ ๊ฐ™์€ colum์„ ๊ฐ€์ง„ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ๋ฉด ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ NaN์œผ๋กœ ์ฑ„์›Œ์ ธ ์ œ๋Œ€๋กœ ์—ฐ์‚ฐ์ด ๋˜์ง€ ์•Š๋Š”๋‹ค.

axis= ์˜ต์…˜์„ ๋ช…์‹œํ•ด์ฃผ์–ด์•ผ ํ•ด๋‹น ์ถ•์„ ๊ธฐ์ค€์œผ๋กœ broadcasting์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

PYTHON
df = DataFrame(
np.arange(16).reshape(4,4),
columns=list("abcd"))
s = Series(
np.arange(10,14),
index=list("abcd"))
s2 = Series(np.arange(10,14))
# Series๊ฐ€ Dataframe์˜ column๊ณผ ๊ฐ™์€ index์„ ๊ฐ€์ ธ ๋ฐ”๋กœ row๋กœ ๋ณ€ํ™˜๋˜๋Š” ๊ฒฝ์šฐ
df + s
# Series๊ฐ€ Dataframe์˜ column๊ณผ ๋‹ค๋ฅธ index์„ ๊ฐ€์ ธ row๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
df + s2 # ๋ชจ๋“  ๊ฐ’์ด NaN์œผ๋กœ ์ฑ„์›Œ์ง
# axis๋ฅผ ๊ธฐ์ค€์œผ๋กœ row broadcasting ์ˆ˜ํ–‰
df.add(s2, axis=0)

row_broadcasting

lambda, map, apply

map

pandas์˜ Series type ๋ฐ์ดํ„ฐ์—๋Š” map ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด ๋•Œ, lambda์™€ ์กฐํ•ฉํ•  ์ˆ˜ ์žˆ๊ณ , function ๋Œ€์‹  dict, sequence ํ˜• ์ž๋ฃŒ ๋“ฑ์œผ๋กœ ๋Œ€์ฒด๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.

๋น„์Šทํ•œ ๊ธฐ๋Šฅ์œผ๋กœ replace ๊ฐ€ ์žˆ๋‹ค.
map ํ•จ์ˆ˜์˜ ๊ธฐ๋Šฅ ์ค‘ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜๊ธฐ๋Šฅ๋งŒ ๋‹ด๋‹นํ•œ๋‹ค. ์ฆ‰, map ์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ์— ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์—†๊ณ , dict์—์„œ ์ •ํ•ด์ง„ value๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ๋งŒ ํ•œ๋‹ค.

PYTHON
# number_df์˜ ๊ฐ’๋“ค์„ ์ œ๊ณฑํ•˜๋Š” mapping
number_df.map(lambda x: x**2).head(5)
# str ์„ฑ๋ณ„๊ฐ’์„ 0,1 ์ฝ”๋“œ๊ฐ’์œผ๋กœ ๋ฐ”๊พธ์–ด sex_code ์ปฌ๋Ÿผ์— ์ถ”๊ฐ€ํ•˜๋Š” mapping
df["sex_code"] = df.sex.map({"male":0, "female":1})
# str ์„ฑ๋ณ„๊ฐ’์„ 0,1 ์ฝ”๋“œ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ถœ๋ ฅ - replace
df.sex.replace({"male":0, "female":1}).head()
# to_replace=๋ณ€ํ™˜ํ•  ํ‚ค๊ฐ’(from), value=๋ณ€ํ™˜ํ•˜์—ฌ ์ €์žฅํ•  ๊ฐ’(to)
# to_replace์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์„ value์— ์žˆ๋Š” ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜์‹œํ‚ค๊ฒ ๋‹ค
df["race"].replace(to_replace=key, value=value, inplace=True)

apply

map ํ•จ์ˆ˜์™€ ๋‹ฌ๋ฆฌ, Series ์ „์ฒด ๋‹จ์œ„์— ํ•ด๋‹น ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค.

  • ์ž…๋ ฅ๊ฐ’์„ Dataframe, Series ๋ชจ๋‘ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.
  • ์Šค์นผ๋ผ ๊ฐ’ ์ด์™ธ์— Series ๊ฐ’์˜ ๋ฐ˜ํ™˜๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.

PYTHON
df_info = df[["earn", "height","age"]]
f = lambda x : x.max() - x.min()
df_info.apply(f)
'''
earn 318047.708444
height 19.870000
age 73.000000
dtype: float64
'''
# Series ๊ฐ’ ๋ฐ˜ํ™˜
def f(x):
return Series([x.min(), x.max(), x.mean()],
index=["min", "max", "mean"])
df_info.apply(f)
'''
earn height age
min -98.580489 57.34000 22.000000
max 317949.127955 77.21000 95.000000
mean 32446.292622 66.59264 45.32849
'''

applymap

apply์™€๋Š” ๋‹ค๋ฅด๊ฒŒ, Series ๋‹จ์œ„๊ฐ€ ์•„๋‹Œ element ๋‹จ์œ„๋กœ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค.

  • apply๋ฅผ dataframe ์ด ์•„๋‹Œ Series ์— ์ ์šฉ์‹œํ‚ค๋ฉด, applymap๊ณผ ๋™์ผํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
PYTHON
f = lambda x : -x
df_info.applymap(f).head(5)
# ์œ„์˜ ์ฝ”๋“œ๋Š” ์•„๋ž˜ ์ฝ”๋“œ์™€ ๋™์ผํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
# f = lambda x : -x
# df_info["earn"].apply(f).head(5)

Pandas Built-in Function

describe

Numeric Type ๋ฐ์ดํ„ฐ๋“ค์˜ ์š”์•ฝ์ •๋ณด๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

PYTHON
df.describe()

describe

unique

Series ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ์œ ์ผํ•œ ๊ฐ’(์ค‘๋ณต์„ ์ œ์™ธํ•œ ๊ฐ’)๋“ค์„ list๋กœ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

PYTHON
df.race.unique()
'''
array(['white', 'asian', 'hispanic', 'black', 'other'], dtype=object)
'''

sum

๊ธฐ๋ณธ์ ์ธ column ๋˜๋Š” row ๊ฐ’์˜ ํ•ฉ์—ฐ์‚ฐ์„ ์ง€์›ํ•œ๋‹ค.

์ด์™ธ์—๋„ ๋น„์Šทํ•œ ๊ธฐ๋Šฅ์˜ ํ•จ์ˆ˜๋“ค์ด ๋งŽ๋‹ค.

  • sub , mean, min, max , count, median , mad, var ๋“ฑ

PYTHON
# ์ปฌ๋Ÿผ๋ณ„ ํ•ฉ
df.sum(axis=0)
# row๋ณ„ ํ•ฉ
df.sum(axis=1)

isnull

column ๋˜๋Š” row ๊ฐ’ ์ค‘ NaN (null) ๊ฐ’์˜ index๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

  • sum ํ•จ์ˆ˜๋ฅผ ์ฒด์ด๋‹ํ•˜์—ฌ Null์ธ ์…€์ด ๋ช‡๊ฐœ ์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

PYTHON
# null์ธ data์˜ index๋ฅผ ๋ฐ˜ํ™˜
df.isnull()
# Null์ธ ์…€ ๊ฐœ์ˆ˜ ํ™•์ธ
df.isnull().sum()

sort_values

column ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ ฌํ•œ๋‹ค.

  • ์˜ต์…˜์„ ๋ช…์‹œํ•ด ์˜ค๋ฆ„์ฐจ์ˆœ, ๋‚ด๋ฆผ์ฐจ์ˆœ๋“ฑ์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

PYTHON
# ascending=False ์‹œ ๋‚ด๋ฆผ์ฐจ์ˆœ
df.sort_values(["age","earn"], ascending=True).head()

Correlation & Covariance

์ƒ๊ด€๊ณ„์ˆ˜์™€ ๊ณต๋ถ„์‚ฐ์„ ๊ตฌํ•ด์ค€๋‹ค.

  • corr , cov , corrwith

PYTHON
# df.[์ปฌ๋Ÿผ1].ํ•จ์ˆ˜(df.[์ปฌ๋Ÿผ2])
df.age.corr(df.earn)
df.age.cov(df.earn)
df.corrwith(df.earn)
# ๋ชจ๋“  ์ปฌ๋Ÿผ๋“ค๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋‹ค ๋ณด์—ฌ์ค€๋‹ค
df.corr()

ํ†ต๊ณ„ํ•จ์ˆ˜

Groupby

ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚ด๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋กœ, ํŠน์ • ์ปฌ๋Ÿผ์˜ ๊ฐ’์„ ์ง€์ •ํ•ด ํ•ด๋‹น ๊ฐ’์„ ๊ฐ€์ง„ row๋“ค์„ SeriesGroupBy ๊ฐ์ฒด๋กœ ๋ฌถ๊ณ , ์ฒด์ด๋‹ ํ†ต๊ณ„ ์—ฐ์‚ฐ์œผ๋กœ ํ†ต๊ณ„ํ™”ํ•˜์—ฌ ๋ณด์—ฌ์ค€๋‹ค.

BASH
# phone_data.csv ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ
!wget https://www.shanelynn.ie/wp-content/uploads/2015/06/phone_data.csv
PYTHON
df_phone = pd.read_csv('./data/phone_data.csv')
df_phone.dtypes
'''
...
date object
...
'''
# ๋ฌธ์ž์—ด ํƒ€์ž…์˜ date ์ปฌ๋Ÿผ ๊ฐ’์„ ๊ฐ๊ฐ(apply) ํŒŒ์‹ฑํ•˜์—ฌ datetime ํƒ€์ž…์œผ๋กœ ๋ณ€ํ™˜
# parse ํ•จ์ˆ˜์˜ argument์—์„œ '์ผ(day)'๊ฐ€ '์›”(month)'๋ณด๋‹ค ๋จผ์ €๋‚˜์˜จ๋‹ค๋ฉด dayfirst=True
df_phone['date'] = df_phone['date'].apply(dateutil.parser.parse, dayfirst=True)
df_phone.dtypes
'''
date datetime64[ns]
'''
# datetime ํƒ€์ž…์˜ ์›”๋ณ„๋กœ ๋ฌถ์–ด, duration(ํ†ตํ™”์‹œ๊ฐ„)์˜ ํ•ฉ๊ณ„๋ฅผ matplotlib๋กœ ๊ทธ๋ž˜ํ”„ ํ‘œ์‹œ
df_phone.groupby("month")["duration"].sum().plot()
# month์™€ item๋ณ„๋กœ ๋ฌถ์–ด duration count
# unstack์œผ๋กœ row ์ธ๋ฑ์Šค ๊ฐ’๋“ค์„ column ์ธ๋ฑ์Šค๋กœ ์žฌ๊ตฌ์„ฑํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
df_phone.groupby(['month','item'])['duration'].count().unstack().plot()
# month๋กœ ๋ฌถ๋˜, ์ธ๋ฑ์Šค๋กœ ์‚ผ์ง€๋Š” ๋ง๊ฒƒ(as_index=False)
# duration์—๋Š” sum์„ ์ ์šฉ์‹œํ‚ฌ๊ฒƒ(agg({"duration":"sum"}))
# aggํ•จ์ˆ˜๋Š” ๊ทธ๋ฃน์— ๋Œ€ํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ์ •์˜ํ•œ ์ง‘๊ณ„ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
df_phone.groupby("month", as_index=False).agg({"duration":"sum"})
# df_phone.groupby("month").agg({"duration":"sum"}).reset_index # ๋น„์Šทํ•œ ๊ฒฐ๊ณผ
# group ํƒ€์ž… ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค๋ฉด
group.add_prefix('duration_') # ์ปฌ๋Ÿผ์ด๋ฆ„์— ์ ‘๋‘์‚ฌ ์ถ”๊ฐ€ํ•˜๊ธฐ

์ž์„ธํ•œ ์„ค๋ช…์€ ๋‹ค์Œ ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•œ๋‹ค.

[Python Data Analysis]24.๋ฐ์ดํ„ฐ ๊ทธ๋ฃนํ™” ํ•จ์ˆ˜ ์ดํ•ดํ•˜๊ธฐ

Pivot Table

์ปฌ๋Ÿผ์— labeling๊ฐ’์„ ์ถ”๊ฐ€ํ•˜์—ฌ Value์— numeric type ๊ฐ’์„ aggregationํ•˜๋Š” ํ˜•ํƒœ

groupby + unstack์˜ ์กฐํ•ฉ์œผ๋กœ๋„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์•ก์…€์„ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์ด ์ต์ˆ™ํ•˜๋‹ค๋ฉด pivot_table์„ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์ด ์ข€ ๋” ์‰ฝ๋‹ค.

BASH
# ๊ฐ ๋ฐ์ดํ„ฐ์˜ value ์œ„์น˜์— ๋“ค์–ด๊ฐˆ ๊ฐ’์€ ["rating"] ์ปฌ๋Ÿผ๊ฐ’๋“ค ์ค‘ ํ•˜๋‚˜๋กœ ์ •ํ•œ๋‹ค.
# critic์˜ ๊ฐ’๋“ค์„ row index๋กœ ํ•˜๊ณ , title์˜ ๊ฐ’๋“ค์„ columns๋กœ ๊ตฌ์„ฑํ•œ๋‹ค.
# ๊ฐ value์— aggfunction "sum"์„ ์ ์šฉํ•˜๊ณ , NaN๊ฐ’์€ 0์œผ๋กœ ์ฑ„์šด๋‹ค.
df_movie.pivot_table(["rating"], index=df_movie.critic, columns=df_movie.title,
aggfunc="sum", fill_value=0)

Crosstab

๋‘ ์ปฌ๋Ÿผ์˜ ๊ต์ฐจ ๋นˆ๋„, ๋น„์œจ, ๋ง์…ˆ ๋“ฑ์„ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ํ•จ์ˆ˜๋กœ, pivot_table์˜ ํŠน์ˆ˜ ํ˜•ํƒœ์ด๋‹ค.

  • User-Item Rating ํ–‰๋ ฌ๋“ฑ์„ ๋งŒ๋“ค ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค.
BASH
# ์„ธ ํ•จ์ˆ˜๋Š” ๊ฐ™์€ ์—ญํ• ์„ ํ•œ๋‹ค.
df_movie.groupby(["critic","title"]).agg({"rating":"first"}).unstack().fillna(0)
df_movie.pivot_table(["rating"], index=df_movie.critic, columns=df_movie.title,
aggfunc="sum", fill_value=0)
pd.crosstab(index=df_movie.critic,columns=df_movie.title,values=df_movie.rating,
aggfunc="first").fillna(0)

Merge & Concat

Merge

SQL์—์„œ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” Merge์™€ ๊ฐ™์€ ๊ธฐ๋Šฅ์œผ๋กœ, ๊ฒน์น˜๋Š” ์ปฌ๋Ÿผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‘๊ฐœ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ๊ธฐ๋Šฅ์ด๋‹ค.

  • ๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ๊ฐ™์€ ์ด๋ฆ„์˜ ์ปฌ๋Ÿผ์ด ์žˆ๋‹ค๋ฉด, on='column_name' ์˜ต์…˜์œผ๋กœ ๊ธฐ์ค€์ ์„ ๋งŒ๋“ ๋‹ค.
  • ๋‘ ์ปฌ๋Ÿผ์ด ๋ฐ์ดํ„ฐ๋Š” ๊ฐ™์ง€๋งŒ ์ด๋ฆ„์€ ๋‹ค๋ฅด๋‹ค๋ฉด, left_on='left_column_name' ,right_on='right_column_name' ์œผ๋กœ ๊ธฐ์ค€์ ์„ ๋งŒ๋“ ๋‹ค.

JOIN์˜ ์ข…๋ฅ˜

joins

  • how= ์˜ต์…˜์„ ์ด์šฉํ•ด join์˜ ์ข…๋ฅ˜๋ฅผ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.

    • how=inner : ๋””ํดํŠธ๋กœ, ๋‘ ์ปฌ๋Ÿผ์—์„œ ๊ฒน์น˜๋Š” ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ธ๋ฑ์Šค๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค.
    • how=left(right) : left(right) ์ปฌ๋Ÿผ์˜ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ธ๋ฑ์Šค๋ฅผ ๊ตฌ์„ฑ
    • how=outer : ๋‘ ์ปฌ๋Ÿผ์˜ ๋ชจ๋“  ๊ฐ’์„ ์ธ๋ฑ์Šค๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, ์—†๋Š” ๊ฐ’์€ NaN์„ ๋„ฃ๋Š”๋‹ค.
  • index ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ ํ…Œ์ด๋ธ”์„ ๊ทธ๋Œ€๋กœ ๋ถ™์ผ์ˆ˜๋„ ์žˆ๋‹ค.

    • right_index=True, left_index=True
BASH
# ๊ฐ™์€ ์ด๋ฆ„์˜ ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ join
pd.merge(df_a, df_b, on='subject_id')
# ๋‹ค๋ฅธ ์ด๋ฆ„์˜ ์ปฌ๋Ÿผ์„ ํ•˜๋‚˜์˜ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณด๊ณ , ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ join
pd.merge(df_a, df_b, left_on='subject_id_a', right_on='subject_id_b')
# left join
pd.merge(df_a, df_b, on='subject_id', how='left')
# outer join
pd.merge(df_a, df_b, on='subject_id', how='outer')
# ์ธ๋ฑ์Šค๋ฅผ ๊ธฐ์ค€์œผ๋กœ join
pd.merge(df_a, df_b, right_index=True, left_index=True)

concat

๊ฐ™์€ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ™์ด๋Š” ์—ฐ์‚ฐํ•จ์ˆ˜๋‹ค.

  • append ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
  • ๋””ํดํŠธ๋Š” ์„ธ๋กœ๋กœ ๋ถ™์ด์ง€๋งŒ(ํ–‰ ์ถ”๊ฐ€), ๊ฐ€๋กœ๋กœ ๋ถ™์ผ์ˆ˜๋„ ์žˆ๋‹ค(์—ด ์ถ”๊ฐ€).
    • ๋ฌผ๋ก , ์„ธ๋กœ๋กœ ๋ถ™์ผ ๊ฒฝ์šฐ ๊ฐ™์€ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ๋ถ™์ผ ์ˆ˜ ์žˆ๋‹ค.
    • axis=1 ์˜ต์…˜์œผ๋กœ ๊ฐ€๋กœ๋กœ ๋ถ™์ผ ์ˆ˜ ์žˆ๋‹ค.
BASH
# ์„ธ๋กœ๋กœ ๋ถ™์ด๊ธฐ
df_new = pd.concat([df_a, df_b])
df_new.reset_index() # ํ•ด์ฃผ์ง€ ์•Š์œผ๋ฉด, ์ธ๋ฑ์Šค๊ฐ€ ๊ทธ๋Œ€๋กœ ๋ถ™์–ด 0,1,2,3,0,1,2, ... ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค.
#df_a.append(df_b) # ๊ฑฐ์˜ ๊ฐ™์€ ๊ธฐ๋Šฅ์„ ๊ฐ€์ง„๋‹ค.
# ๊ฐ€๋กœ๋กœ ๋ถ™์ด๊ธฐ
df_new = pd.concat([df_a, df_b], axis=1)
df_new.reset_index()

Persistence

Pandas์—์„œ I/O๋ฅผ ์‚ฌ์šฉํ•ด ์˜์†์„ฑ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Database Connection

๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์‹œ, db connection ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค.

BASH
import sqlite3 # pymysql์„ ๋ฏธ๋ฆฌ ์„ค์น˜ํ•  ๊ฒƒ!
conn = sqlite3.connect(".sqlite_example.db")
cur = conn.cursor()
cur.execute("select * from name limit 5;") # SQL๋ฌธ ์ง์ ‘ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
results = cur.fetchall()
results

XLS Persistence

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋‹จ์œ„๋กœ ์—‘์…€๊ฐ’์„ ์ถ”์ถœํ•˜๊ฑฐ๋‚˜ ์“ธ ์ˆ˜ ์žˆ๋‹ค.

openpyxl ๋˜๋Š” XlsxWriter ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผํ•œ๋‹ค.

BASH
writer = pd.ExcelWriter('./data/df_routes.xlsx', engine='xlsxwriter')
# ์•ก์…€ ํ˜•ํƒœ๋กœ ์ €์žฅ
df_routes.to_excel(writer, sheet_name='Sheet1')

Pickle Persistence

์—‘์…€๊ณผ ๋™์ผํ•˜๊ฒŒ ํ”ผํดํ˜•ํƒœ๋กœ๋„ ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ๋‹ค.

BASH
df_routes_pickle = pd.read_pickle("./data/df_routes.pickle")
df_routes_pickle.head()

WRITTEN BY

์•ŒํŒŒ์นด์˜ Always Awake Devlog

Seoul