Dataholic: Wrangle data with tidyr

유충현

tidyr 패키지는 세계보건기구(WHO, World Health Organization)에서 발표한 결핵 신규 환자 현황 데이터인 who와 관련된 국가별 인구통계 데이터인 population을 제공합니다.

데이터 살펴보기

who

population

데이터 준비하기

who 데이터 프레임의 변수 이름이 포맷에 벗어난 것이 있습니다. 이것을 바로 잡고 시작합니다.

앞에서 변수에 대한 설명에서 new_sp_m014 - new_rel_f65의 첫번째 코드는 진단방법(method of diagnosis)입니다. 그런데 진단방법의 rel의 경우에는 다른 변수 이름과는 달리 “new”라는 접두어와 언다라인(_)으로 분리되지 않습니다. 그래서 일관성을 유지하기 위해서 언다라인을 붙입니다.

피봇팅 워밍업

pivot_wider

Hands-on 1

pivot_wider() 함수는 Long 포맷 테이블을 Wide 포맷 테이블로 변환해주는 함수입니다.

dplyr 패키지에서 학습한 방법을 사용하세요.
pivot_wider() 함수를 사용합니다.
dplyr 패키지의 rename() 함수를 사용합니다.

rename_if() 함수는 조건을 만족하는 변수들의 변수 이름을 변경하는 dplyr 패키지의 함수입니다.

# 1.
population %>% 
  filter(country %in% "Republic of Korea") %>% 
  filter(year >= 2010)

# A tibble: 4 x 3
  country            year population
  <chr>             <int>      <int>
1 Republic of Korea  2010   48453931
2 Republic of Korea  2011   48732640
3 Republic of Korea  2012   49002683
4 Republic of Korea  2013   49262698

# 2.
population_wide <- population %>% 
  filter(country %in% "Republic of Korea") %>% 
    filter(year >= 2010) %>% 
  pivot_wider(names_from = "year",
              values_from = "population")

population_wide

# A tibble: 1 x 5
  country             `2010`   `2011`   `2012`   `2013`
  <chr>                <int>    <int>    <int>    <int>
1 Republic of Korea 48453931 48732640 49002683 49262698

# 3.
population_wide2 <- population_wide %>% 
  rename_if(is.integer, function(x) paste0("y", x))

population_wide2

# A tibble: 1 x 5
  country              y2010    y2011    y2012    y2013
  <chr>                <int>    <int>    <int>    <int>
1 Republic of Korea 48453931 48732640 49002683 49262698

pivot_longer

pivot_longer() 함수는 Wide 포맷 테이블을 Long 포맷 테이블로 변환해주는 함수입니다.

Hands-on 2

pivot_longer() 함수를 사용합니다.
dplyr 패키지의 rename() 함수를 사용합니다.

population_long() 함수는 pivot_wider() 함수의 역함수입니다.

# 1.
population_long <- population_wide2 %>% 
  pivot_longer(y2010:y2013, 
               names_to = "year",
               values_to = "population")

population_long

# A tibble: 4 x 3
  country           year  population
  <chr>             <chr>      <int>
1 Republic of Korea y2010   48453931
2 Republic of Korea y2011   48732640
3 Republic of Korea y2012   49002683
4 Republic of Korea y2013   49262698

# 2.
population_long2 <- population_long %>% 
  mutate(year = stringr::str_replace(year, "y", "") %>% 
           as.integer())

population_long2

# A tibble: 4 x 3
  country            year population
  <chr>             <int>      <int>
1 Republic of Korea  2010   48453931
2 Republic of Korea  2011   48732640
3 Republic of Korea  2012   49002683
4 Republic of Korea  2013   49262698

피봇팅 응용

dplyr 패키지와의 콜라보

tidyr 패키지는 단독으로 사용되기보다는 dplyr 패키지와 혼용되는 경우가 많습니다.

Hands-on 3

- 폐외부 결핵의 진단방법에 대한 그룹 코드는 “ep”를 포함합니다.
- dplyr 패키지의 select() 함수로 폐외부에 결핵이 발생한 환자 정보만 가져옵니다.
- pivot_longer() 함수를 사용합니다.
- 집계는 summarise() 함수를 사용합니다.
- dplyr 패키지의 mutate() 함수로 변수 gender와 age_group를 파생합니다.
- stringr 패키지로 변수 이름을 조작할 필요가 있습니다.
- pivot_wider() 함수를 사용합니다.
- dplyr 패키지의 mutate() 함수로 변수 total과 percent를 파생합니다.

stringr::str_detect(group_code, "_m“)은”_m"이 포함될 경우에 TRUE를 반환합니다.

정규표현식 “[^[:digit:]]”은 숫자를 제외한 것과 패턴매칭됩니다.

# 1.1
ep_long <- who %>% 
  filter(country %in% "Republic of Korea") %>% 
  filter(year %in% "2010") %>% 
  select(new_ep_m014:new_ep_f65) %>% 
  pivot_longer(new_ep_m014:new_ep_f65, 
               names_to = "group_code",
               values_to = "cases")

ep_long

# A tibble: 14 x 2
   group_code   cases
   <chr>        <int>
 1 new_ep_m014     62
 2 new_ep_m1524   511
 3 new_ep_m2534   646
 4 new_ep_m3544   690
 5 new_ep_m4554   687
 6 new_ep_m5564   571
 7 new_ep_m65    1088
 8 new_ep_f014     56
 9 new_ep_f1524   439
10 new_ep_f2534   685
11 new_ep_f3544   644
12 new_ep_f4554   780
13 new_ep_f5564   584
14 new_ep_f65    1352

# 1.2
ep_long %>% 
  summarise(total_cases = sum(cases, na.rm = TRUE))

# A tibble: 1 x 1
  total_cases
        <int>
1        8795

# 2.1
ep_wide <- ep_long %>% 
  mutate(gender = case_when (
    stringr::str_detect(group_code, "_m") ~ "male",
    !stringr::str_detect(group_code, "_m") ~ "female")
  ) %>% 
  mutate(age_group = stringr::str_remove_all(group_code, "[^[:digit:]]")) %>% 
  mutate(age_group = paste("age", age_group, sep = "_")) %>% 
  select(-group_code) %>% 
  pivot_wider(names_from = "gender",
              values_from = "cases")

# 2.2
ep_wide %>% 
  mutate(total = male + female) %>% 
  mutate(percent = round(total / sum(total) * 100, 2))

# A tibble: 7 x 5
  age_group  male female total percent
  <chr>     <int>  <int> <int>   <dbl>
1 age_014      62     56   118    1.34
2 age_1524    511    439   950   10.8 
3 age_2534    646    685  1331   15.1 
4 age_3544    690    644  1334   15.2 
5 age_4554    687    780  1467   16.7 
6 age_5564    571    584  1155   13.1 
7 age_65     1088   1352  2440   27.7

Wrangle data with tidyr

Author

Affiliation

Published

Citation

들어가기

준비하기

사용할 데이터