The diagnose_category() produces information for diagnosing the quality of the variables of data.frame or tbl_df.

diagnose_category(.data, ...)

# S3 method for data.frame
diagnose_category(
  .data,
  ...,
  top = 10,
  type = c("rank", "n")[2],
  add_character = TRUE,
  add_date = TRUE
)

# S3 method for grouped_df
diagnose_category(
  .data,
  ...,
  top = 10,
  type = c("rank", "n")[2],
  add_character = TRUE,
  add_date = TRUE
)

Arguments

.data

a data.frame or a tbl_df or a grouped_df.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

top

an integer. Specifies the upper top rows or rank to extract. Default is 10.

type

a character string specifying how result are extracted. "rank" that extract top n ranks by decreasing frequency. In this case, if there are ties in rank, more rows than the number specified by the top argument are returned. Default is "n" extract only top n rows by decreasing frequency. If there are too many rows to be returned because there are too many ties, you can adjust the returned rows appropriately by using "n".

add_character

logical. Decide whether to include text variables in the diagnosis of categorical data. The default value is TRUE, which also includes character variables.

add_date

ogical. Decide whether to include Date and POSIXct variables in the diagnosis of categorical data. The default value is TRUE, which also includes character variables.

Value

an object of tbl_df.

Details

The scope of the diagnosis is the occupancy status of the levels in categorical data. If a certain level of occupancy is close to 100 then the removal of this variable in the forecast model will have to be considered. Also, if the occupancy of all levels is close to 0 variable is likely to be an identifier.

Categorical diagnostic information

The information derived from the categorical data diagnosis is as follows.

  • variables : variable names

  • levels: level names

  • N : number of observation

  • freq : number of observation at the levels

  • ratio : percentage of observation at the levels

  • rank : rank of occupancy ratio of levels

See vignette("diagonosis") for an introduction to these concepts.

Examples

# \donttest{
# Diagnosis of categorical variables
diagnose_category(jobchange)
#>              variables                  levels     N  freq        ratio rank
#> 1          enrollee_id                       1 19158     1  0.005219752    1
#> 2          enrollee_id                      10 19158     1  0.005219752    1
#> 3          enrollee_id                   10000 19158     1  0.005219752    1
#> 4          enrollee_id                   10001 19158     1  0.005219752    1
#> 5          enrollee_id                   10002 19158     1  0.005219752    1
#> 6          enrollee_id                   10003 19158     1  0.005219752    1
#> 7          enrollee_id                   10004 19158     1  0.005219752    1
#> 8          enrollee_id                   10005 19158     1  0.005219752    1
#> 9          enrollee_id                   10006 19158     1  0.005219752    1
#> 10         enrollee_id                   10008 19158     1  0.005219752    1
#> 11                city                city_103 19158  4355 22.732017956    1
#> 12                city                 city_21 19158  2702 14.103768661    2
#> 13                city                 city_16 19158  1533  8.001879111    3
#> 14                city                city_114 19158  1336  6.973588057    4
#> 15                city                city_160 19158   845  4.410690051    5
#> 16                city                city_136 19158   586  3.058774402    6
#> 17                city                 city_67 19158   431  2.249712914    7
#> 18                city                 city_75 19158   305  1.592024220    8
#> 19                city                city_102 19158   304  1.586804468    9
#> 20                city                city_104 19158   301  1.571145213   10
#> 21              gender                    Male 19158 13221 69.010335108    1
#> 22              gender                    <NA> 19158  4508 23.530639942    2
#> 23              gender                  Female 19158  1238  6.462052406    3
#> 24              gender                   Other 19158   191  0.996972544    4
#> 25 relevent_experience Has relevent experience 19158 13792 71.990813237    1
#> 26 relevent_experience  No relevent experience 19158  5366 28.009186763    2
#> 27 enrolled_university           no_enrollment 19158 13817 72.121307026    1
#> 28 enrolled_university        Full time course 19158  3757 19.610606535    2
#> 29 enrolled_university        Part time course 19158  1198  6.253262345    3
#> 30 enrolled_university                    <NA> 19158   386  2.014824094    4
#> 31     education_level                Graduate 19158 11598 60.538678359    1
#> 32     education_level                 Masters 19158  4361 22.763336465    2
#> 33     education_level             High School 19158  2017 10.528238856    3
#> 34     education_level                    <NA> 19158   460  2.401085708    4
#> 35     education_level                     Phd 19158   414  2.160977137    5
#> 36     education_level          Primary School 19158   308  1.607683474    6
#> 37    major_discipline                    STEM 19158 14492 75.644639315    1
#> 38    major_discipline                    <NA> 19158  2813 14.683161082    2
#> 39    major_discipline              Humanities 19158   669  3.492013780    3
#> 40    major_discipline                   Other 19158   381  1.988725337    4
#> 41    major_discipline         Business Degree 19158   327  1.706858754    5
#> 42    major_discipline                    Arts 19158   253  1.320597140    6
#> 43    major_discipline                No Major 19158   223  1.164004593    7
#> 44          experience                     >20 19158  3286 17.152103560    1
#> 45          experience                       5 19158  1430  7.464244702    2
#> 46          experience                       4 19158  1403  7.323311410    3
#> 47          experience                       3 19158  1354  7.067543585    4
#> 48          experience                       6 19158  1216  6.347217872    5
#> 49          experience                       2 19158  1127  5.882659985    6
#> 50          experience                       7 19158  1028  5.365904583    7
#> 51          experience                      10 19158   985  5.141455267    8
#> 52          experience                       9 19158   980  5.115356509    9
#> 53          experience                       8 19158   802  4.186240735   10
#> 54        company_size                    <NA> 19158  5938 30.994884643    1
#> 55        company_size                   50-99 19158  3083 16.092493997    2
#> 56        company_size                 100-499 19158  2571 13.419981209    3
#> 57        company_size                  10000+ 19158  2019 10.538678359    4
#> 58        company_size                   10-49 19158  1471  7.678254515    5
#> 59        company_size               1000-4999 19158  1328  6.931830045    6
#> 60        company_size                     <10 19158  1308  6.827435014    7
#> 61        company_size                 500-999 19158   877  4.577722100    8
#> 62        company_size               5000-9999 19158   563  2.938720117    9
#> 63        company_type                 Pvt Ltd 19158  9817 51.242300866    1
#> 64        company_type                    <NA> 19158  6140 32.049274455    2
#> 65        company_type          Funded Startup 19158  1001  5.224971291    3
#> 66        company_type           Public Sector 19158   955  4.984862721    4
#> 67        company_type     Early Stage Startup 19158   603  3.147510179    5
#> 68        company_type                     NGO 19158   521  2.719490552    6
#> 69        company_type                   Other 19158   121  0.631589936    7
#> 70        last_new_job                       1 19158  8040 41.966802380    1
#> 71        last_new_job                      >4 19158  3290 17.172982566    2
#> 72        last_new_job                       2 19158  2900 15.137279465    3
#> 73        last_new_job                   never 19158  2452 12.798830776    4
#> 74        last_new_job                       4 19158  1029  5.371124334    5
#> 75        last_new_job                       3 19158  1024  5.345025577    6
#> 76        last_new_job                    <NA> 19158   423  2.207954901    7
#> 77           job_chnge                      No 19158 14381 75.065246894    1
#> 78           job_chnge                     Yes 19158  4777 24.934753106    2

# Select the variable to diagnose
diagnose_category(jobchange, education_level, company_type)
#>          variables              levels     N  freq      ratio rank
#> 1  education_level            Graduate 19158 11598 60.5386784    1
#> 2  education_level             Masters 19158  4361 22.7633365    2
#> 3  education_level         High School 19158  2017 10.5282389    3
#> 4  education_level                <NA> 19158   460  2.4010857    4
#> 5  education_level                 Phd 19158   414  2.1609771    5
#> 6  education_level      Primary School 19158   308  1.6076835    6
#> 7     company_type             Pvt Ltd 19158  9817 51.2423009    1
#> 8     company_type                <NA> 19158  6140 32.0492745    2
#> 9     company_type      Funded Startup 19158  1001  5.2249713    3
#> 10    company_type       Public Sector 19158   955  4.9848627    4
#> 11    company_type Early Stage Startup 19158   603  3.1475102    5
#> 12    company_type                 NGO 19158   521  2.7194906    6
#> 13    company_type               Other 19158   121  0.6315899    7

# Using pipes ---------------------------------
library(dplyr)

# Diagnosis of all categorical variables
jobchange %>%
  diagnose_category()
#>              variables                  levels     N  freq        ratio rank
#> 1          enrollee_id                       1 19158     1  0.005219752    1
#> 2          enrollee_id                      10 19158     1  0.005219752    1
#> 3          enrollee_id                   10000 19158     1  0.005219752    1
#> 4          enrollee_id                   10001 19158     1  0.005219752    1
#> 5          enrollee_id                   10002 19158     1  0.005219752    1
#> 6          enrollee_id                   10003 19158     1  0.005219752    1
#> 7          enrollee_id                   10004 19158     1  0.005219752    1
#> 8          enrollee_id                   10005 19158     1  0.005219752    1
#> 9          enrollee_id                   10006 19158     1  0.005219752    1
#> 10         enrollee_id                   10008 19158     1  0.005219752    1
#> 11                city                city_103 19158  4355 22.732017956    1
#> 12                city                 city_21 19158  2702 14.103768661    2
#> 13                city                 city_16 19158  1533  8.001879111    3
#> 14                city                city_114 19158  1336  6.973588057    4
#> 15                city                city_160 19158   845  4.410690051    5
#> 16                city                city_136 19158   586  3.058774402    6
#> 17                city                 city_67 19158   431  2.249712914    7
#> 18                city                 city_75 19158   305  1.592024220    8
#> 19                city                city_102 19158   304  1.586804468    9
#> 20                city                city_104 19158   301  1.571145213   10
#> 21              gender                    Male 19158 13221 69.010335108    1
#> 22              gender                    <NA> 19158  4508 23.530639942    2
#> 23              gender                  Female 19158  1238  6.462052406    3
#> 24              gender                   Other 19158   191  0.996972544    4
#> 25 relevent_experience Has relevent experience 19158 13792 71.990813237    1
#> 26 relevent_experience  No relevent experience 19158  5366 28.009186763    2
#> 27 enrolled_university           no_enrollment 19158 13817 72.121307026    1
#> 28 enrolled_university        Full time course 19158  3757 19.610606535    2
#> 29 enrolled_university        Part time course 19158  1198  6.253262345    3
#> 30 enrolled_university                    <NA> 19158   386  2.014824094    4
#> 31     education_level                Graduate 19158 11598 60.538678359    1
#> 32     education_level                 Masters 19158  4361 22.763336465    2
#> 33     education_level             High School 19158  2017 10.528238856    3
#> 34     education_level                    <NA> 19158   460  2.401085708    4
#> 35     education_level                     Phd 19158   414  2.160977137    5
#> 36     education_level          Primary School 19158   308  1.607683474    6
#> 37    major_discipline                    STEM 19158 14492 75.644639315    1
#> 38    major_discipline                    <NA> 19158  2813 14.683161082    2
#> 39    major_discipline              Humanities 19158   669  3.492013780    3
#> 40    major_discipline                   Other 19158   381  1.988725337    4
#> 41    major_discipline         Business Degree 19158   327  1.706858754    5
#> 42    major_discipline                    Arts 19158   253  1.320597140    6
#> 43    major_discipline                No Major 19158   223  1.164004593    7
#> 44          experience                     >20 19158  3286 17.152103560    1
#> 45          experience                       5 19158  1430  7.464244702    2
#> 46          experience                       4 19158  1403  7.323311410    3
#> 47          experience                       3 19158  1354  7.067543585    4
#> 48          experience                       6 19158  1216  6.347217872    5
#> 49          experience                       2 19158  1127  5.882659985    6
#> 50          experience                       7 19158  1028  5.365904583    7
#> 51          experience                      10 19158   985  5.141455267    8
#> 52          experience                       9 19158   980  5.115356509    9
#> 53          experience                       8 19158   802  4.186240735   10
#> 54        company_size                    <NA> 19158  5938 30.994884643    1
#> 55        company_size                   50-99 19158  3083 16.092493997    2
#> 56        company_size                 100-499 19158  2571 13.419981209    3
#> 57        company_size                  10000+ 19158  2019 10.538678359    4
#> 58        company_size                   10-49 19158  1471  7.678254515    5
#> 59        company_size               1000-4999 19158  1328  6.931830045    6
#> 60        company_size                     <10 19158  1308  6.827435014    7
#> 61        company_size                 500-999 19158   877  4.577722100    8
#> 62        company_size               5000-9999 19158   563  2.938720117    9
#> 63        company_type                 Pvt Ltd 19158  9817 51.242300866    1
#> 64        company_type                    <NA> 19158  6140 32.049274455    2
#> 65        company_type          Funded Startup 19158  1001  5.224971291    3
#> 66        company_type           Public Sector 19158   955  4.984862721    4
#> 67        company_type     Early Stage Startup 19158   603  3.147510179    5
#> 68        company_type                     NGO 19158   521  2.719490552    6
#> 69        company_type                   Other 19158   121  0.631589936    7
#> 70        last_new_job                       1 19158  8040 41.966802380    1
#> 71        last_new_job                      >4 19158  3290 17.172982566    2
#> 72        last_new_job                       2 19158  2900 15.137279465    3
#> 73        last_new_job                   never 19158  2452 12.798830776    4
#> 74        last_new_job                       4 19158  1029  5.371124334    5
#> 75        last_new_job                       3 19158  1024  5.345025577    6
#> 76        last_new_job                    <NA> 19158   423  2.207954901    7
#> 77           job_chnge                      No 19158 14381 75.065246894    1
#> 78           job_chnge                     Yes 19158  4777 24.934753106    2

# Positive values select variables
jobchange %>%
 diagnose_category(company_type, job_chnge)
#>      variables              levels     N  freq      ratio rank
#> 1 company_type             Pvt Ltd 19158  9817 51.2423009    1
#> 2 company_type                <NA> 19158  6140 32.0492745    2
#> 3 company_type      Funded Startup 19158  1001  5.2249713    3
#> 4 company_type       Public Sector 19158   955  4.9848627    4
#> 5 company_type Early Stage Startup 19158   603  3.1475102    5
#> 6 company_type                 NGO 19158   521  2.7194906    6
#> 7 company_type               Other 19158   121  0.6315899    7
#> 8    job_chnge                  No 19158 14381 75.0652469    1
#> 9    job_chnge                 Yes 19158  4777 24.9347531    2
 
# Negative values to drop variables
jobchange %>%
  diagnose_category(-company_type, -job_chnge)
#>              variables                  levels     N  freq        ratio rank
#> 1          enrollee_id                       1 19158     1  0.005219752    1
#> 2          enrollee_id                      10 19158     1  0.005219752    1
#> 3          enrollee_id                   10000 19158     1  0.005219752    1
#> 4          enrollee_id                   10001 19158     1  0.005219752    1
#> 5          enrollee_id                   10002 19158     1  0.005219752    1
#> 6          enrollee_id                   10003 19158     1  0.005219752    1
#> 7          enrollee_id                   10004 19158     1  0.005219752    1
#> 8          enrollee_id                   10005 19158     1  0.005219752    1
#> 9          enrollee_id                   10006 19158     1  0.005219752    1
#> 10         enrollee_id                   10008 19158     1  0.005219752    1
#> 11                city                city_103 19158  4355 22.732017956    1
#> 12                city                 city_21 19158  2702 14.103768661    2
#> 13                city                 city_16 19158  1533  8.001879111    3
#> 14                city                city_114 19158  1336  6.973588057    4
#> 15                city                city_160 19158   845  4.410690051    5
#> 16                city                city_136 19158   586  3.058774402    6
#> 17                city                 city_67 19158   431  2.249712914    7
#> 18                city                 city_75 19158   305  1.592024220    8
#> 19                city                city_102 19158   304  1.586804468    9
#> 20                city                city_104 19158   301  1.571145213   10
#> 21              gender                    Male 19158 13221 69.010335108    1
#> 22              gender                    <NA> 19158  4508 23.530639942    2
#> 23              gender                  Female 19158  1238  6.462052406    3
#> 24              gender                   Other 19158   191  0.996972544    4
#> 25 relevent_experience Has relevent experience 19158 13792 71.990813237    1
#> 26 relevent_experience  No relevent experience 19158  5366 28.009186763    2
#> 27 enrolled_university           no_enrollment 19158 13817 72.121307026    1
#> 28 enrolled_university        Full time course 19158  3757 19.610606535    2
#> 29 enrolled_university        Part time course 19158  1198  6.253262345    3
#> 30 enrolled_university                    <NA> 19158   386  2.014824094    4
#> 31     education_level                Graduate 19158 11598 60.538678359    1
#> 32     education_level                 Masters 19158  4361 22.763336465    2
#> 33     education_level             High School 19158  2017 10.528238856    3
#> 34     education_level                    <NA> 19158   460  2.401085708    4
#> 35     education_level                     Phd 19158   414  2.160977137    5
#> 36     education_level          Primary School 19158   308  1.607683474    6
#> 37    major_discipline                    STEM 19158 14492 75.644639315    1
#> 38    major_discipline                    <NA> 19158  2813 14.683161082    2
#> 39    major_discipline              Humanities 19158   669  3.492013780    3
#> 40    major_discipline                   Other 19158   381  1.988725337    4
#> 41    major_discipline         Business Degree 19158   327  1.706858754    5
#> 42    major_discipline                    Arts 19158   253  1.320597140    6
#> 43    major_discipline                No Major 19158   223  1.164004593    7
#> 44          experience                     >20 19158  3286 17.152103560    1
#> 45          experience                       5 19158  1430  7.464244702    2
#> 46          experience                       4 19158  1403  7.323311410    3
#> 47          experience                       3 19158  1354  7.067543585    4
#> 48          experience                       6 19158  1216  6.347217872    5
#> 49          experience                       2 19158  1127  5.882659985    6
#> 50          experience                       7 19158  1028  5.365904583    7
#> 51          experience                      10 19158   985  5.141455267    8
#> 52          experience                       9 19158   980  5.115356509    9
#> 53          experience                       8 19158   802  4.186240735   10
#> 54        company_size                    <NA> 19158  5938 30.994884643    1
#> 55        company_size                   50-99 19158  3083 16.092493997    2
#> 56        company_size                 100-499 19158  2571 13.419981209    3
#> 57        company_size                  10000+ 19158  2019 10.538678359    4
#> 58        company_size                   10-49 19158  1471  7.678254515    5
#> 59        company_size               1000-4999 19158  1328  6.931830045    6
#> 60        company_size                     <10 19158  1308  6.827435014    7
#> 61        company_size                 500-999 19158   877  4.577722100    8
#> 62        company_size               5000-9999 19158   563  2.938720117    9
#> 63        last_new_job                       1 19158  8040 41.966802380    1
#> 64        last_new_job                      >4 19158  3290 17.172982566    2
#> 65        last_new_job                       2 19158  2900 15.137279465    3
#> 66        last_new_job                   never 19158  2452 12.798830776    4
#> 67        last_new_job                       4 19158  1029  5.371124334    5
#> 68        last_new_job                       3 19158  1024  5.345025577    6
#> 69        last_new_job                    <NA> 19158   423  2.207954901    7
  
# Top rank levels with top argument
jobchange %>%
  diagnose_category(top = 2)
#>              variables                  levels     N  freq        ratio rank
#> 1          enrollee_id                       1 19158     1  0.005219752    1
#> 2          enrollee_id                      10 19158     1  0.005219752    1
#> 3                 city                city_103 19158  4355 22.732017956    1
#> 4                 city                 city_21 19158  2702 14.103768661    2
#> 5               gender                    Male 19158 13221 69.010335108    1
#> 6               gender                    <NA> 19158  4508 23.530639942    2
#> 7  relevent_experience Has relevent experience 19158 13792 71.990813237    1
#> 8  relevent_experience  No relevent experience 19158  5366 28.009186763    2
#> 9  enrolled_university           no_enrollment 19158 13817 72.121307026    1
#> 10 enrolled_university        Full time course 19158  3757 19.610606535    2
#> 11     education_level                Graduate 19158 11598 60.538678359    1
#> 12     education_level                 Masters 19158  4361 22.763336465    2
#> 13    major_discipline                    STEM 19158 14492 75.644639315    1
#> 14    major_discipline                    <NA> 19158  2813 14.683161082    2
#> 15          experience                     >20 19158  3286 17.152103560    1
#> 16          experience                       5 19158  1430  7.464244702    2
#> 17        company_size                    <NA> 19158  5938 30.994884643    1
#> 18        company_size                   50-99 19158  3083 16.092493997    2
#> 19        company_type                 Pvt Ltd 19158  9817 51.242300866    1
#> 20        company_type                    <NA> 19158  6140 32.049274455    2
#> 21        last_new_job                       1 19158  8040 41.966802380    1
#> 22        last_new_job                      >4 19158  3290 17.172982566    2
#> 23           job_chnge                      No 19158 14381 75.065246894    1
#> 24           job_chnge                     Yes 19158  4777 24.934753106    2
  
# Using pipes & dplyr -------------------------
# Extraction of level that is more than 60% of categorical data
jobchange %>%
  diagnose_category()  %>%
  filter(ratio >= 60)
#>             variables                  levels     N  freq    ratio rank
#> 1              gender                    Male 19158 13221 69.01034    1
#> 2 relevent_experience Has relevent experience 19158 13792 71.99081    1
#> 3 enrolled_university           no_enrollment 19158 13817 72.12131    1
#> 4     education_level                Graduate 19158 11598 60.53868    1
#> 5    major_discipline                    STEM 19158 14492 75.64464    1
#> 6           job_chnge                      No 19158 14381 75.06525    1

# All observations of enrollee_id have a rank of 1. 
# Because it is a unique identifier. Therefore, if you select up to the top rank 3, 
# all records are displayed. It will probably fill your screen.

# extract rows that less than equal rank 3
# default of type argument is "n"
jobchange %>% 
  diagnose_category(enrollee_id, top = 3)
#> # A tibble: 3 × 6
#>   variables   levels     N  freq   ratio  rank
#>   <chr>       <chr>  <int> <int>   <dbl> <int>
#> 1 enrollee_id 1      19158     1 0.00522     1
#> 2 enrollee_id 10     19158     1 0.00522     1
#> 3 enrollee_id 10000  19158     1 0.00522     1

# extract rows that less than equal rank 3
jobchange %>% 
  diagnose_category(enrollee_id, top = 3, type = "rank")
#> # A tibble: 19,158 × 6
#>    variables   levels     N  freq   ratio  rank
#>    <chr>       <chr>  <int> <int>   <dbl> <int>
#>  1 enrollee_id 1      19158     1 0.00522     1
#>  2 enrollee_id 10     19158     1 0.00522     1
#>  3 enrollee_id 10000  19158     1 0.00522     1
#>  4 enrollee_id 10001  19158     1 0.00522     1
#>  5 enrollee_id 10002  19158     1 0.00522     1
#>  6 enrollee_id 10003  19158     1 0.00522     1
#>  7 enrollee_id 10004  19158     1 0.00522     1
#>  8 enrollee_id 10005  19158     1 0.00522     1
#>  9 enrollee_id 10006  19158     1 0.00522     1
#> 10 enrollee_id 10008  19158     1 0.00522     1
#> # ℹ 19,148 more rows
 
# extract only 3 rows
jobchange %>% 
  diagnose_category(enrollee_id, top = 3, type = "n")
#> # A tibble: 3 × 6
#>   variables   levels     N  freq   ratio  rank
#>   <chr>       <chr>  <int> <int>   <dbl> <int>
#> 1 enrollee_id 1      19158     1 0.00522     1
#> 2 enrollee_id 10     19158     1 0.00522     1
#> 3 enrollee_id 10000  19158     1 0.00522     1

# Using group_by ------------------------------
# Calculate the diagnosis of 'company_type' variable by 'job_chnge' using group_by()
jobchange %>%
  group_by(job_chnge) %>% 
  diagnose_category(company_type) 
#> # A tibble: 14 × 7
#>    variables    job_chnge levels                  N  freq  ratio  rank
#>    <chr>        <fct>     <chr>               <int> <int>  <dbl> <int>
#>  1 company_type No        Pvt Ltd             14381  8042 55.9       1
#>  2 company_type No        NA                  14381  3756 26.1       2
#>  3 company_type No        Funded Startup      14381   861  5.99      3
#>  4 company_type No        Public Sector       14381   745  5.18      4
#>  5 company_type No        Early Stage Startup 14381   461  3.21      5
#>  6 company_type No        NGO                 14381   424  2.95      6
#>  7 company_type No        Other               14381    92  0.640     7
#>  8 company_type Yes       NA                   4777  2384 49.9       1
#>  9 company_type Yes       Pvt Ltd              4777  1775 37.2       2
#> 10 company_type Yes       Public Sector        4777   210  4.40      3
#> 11 company_type Yes       Early Stage Startup  4777   142  2.97      4
#> 12 company_type Yes       Funded Startup       4777   140  2.93      5
#> 13 company_type Yes       NGO                  4777    97  2.03      6
#> 14 company_type Yes       Other                4777    29  0.607     7
# }