NPDHS data pooling pre-checks

Getting started

Here we show the pre-requisite code sections. Run these at the outset to avoid errors. First we load the required packages.

easypackages::libraries(
  # Data i/o
  "here",                 # relative file path
  "rio",                  # file import-export
  
  # Data manipulation
  "janitor",              # data cleaning fns
  "haven",                # stata, sas, spss data io
  "labelled",             # var labelling
  "readxl",               # excel sheets
  # "scales",               # to change formats and units
  "skimr",                # quick data summary
  "broom",                # view model results
  
  # Data analysis
  "DHS.rates",            # demographic rates for dhs-like surveys
  "GeneralOaxaca",        # BO decomposition for non-linear
  "survey",               # apply survey weights
  
  # Analysis output
  "gt",
  # "modelsummary",          # output summary tables
  "gtsummary",            # output summary tables
  "flextable",            # creating tables from objects
  "officer",              # editing in office docs
  
  # R graph related packages
  "ggstats",
  "RColorBrewer",
  # "scales",
  "patchwork",
  
  # Misc packages
  "tidyverse",            # Data manipulation iron man
  "tictoc"                # Code timing
)

Next we turn off scientific notations.

options(scipen = 999)

Next we set the default gtsummary print engine for tables.

theme_gtsummary_printer(print_engine = "flextable")

Now we set the flextable output defaults.

set_flextable_defaults(
  font.size = 11,
  text.align = "left",
  big.mark = "",
  background.color = "white",
  table.layout = "autofit",
  theme_fun = theme_vanilla
)

Document introduction

Here we document the variable codes and labels of variables across all the Nepal Demographic and Health Survey (DHS) datasets. We check the variable labels and codes before running the pooling code in “daprep-v01_npdhs.R”. We pool the following Nepal DHS surveys:

# Creating the table of surveys to be used for pooling
npbr1_tmp_intro |> 
  mutate(n_births = prettyNum(n_births, big.mark = ",")) |> 
  select(c(ctr_name, svy_year, n_births)) |> 
  # Join vars from npir_tmp_intro
  left_join(
    npir1_tmp_intro |> 
      mutate(n_women = prettyNum(n_women, big.mark = ",")) |> 
      select(c(year, n_women)),
    by = join_by(svy_year == year)
  ) |> 
  # Join vars from nphr_tmp_intro
  left_join(
    nphr1_tmp_intro |> 
      mutate(n_households = prettyNum(n_households, big.mark = ",")) |> 
      select(svy_year, n_households),
    by = join_by(svy_year)
  ) |> 
  # Join vars from nppr_tmp_intro
  left_join(
    nppr1_tmp_intro |> 
      mutate(n_persons = prettyNum(n_persons, big.mark = ",")) |> 
      select(svy_year, n_persons),
    by = join_by(svy_year)
  ) |> 
  # convert nested tibble to simple tibble
  unnest(cols = c()) |> 
  mutate(
    ccode = row_number(), 
    .before = ctr_name
  ) |> 
  # convert to flextable object
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 1: Nepal DHS datasets and their sample size to be used for pooling

ccode

ctr_name

svy_year

n_births

n_women

n_households

n_persons

1

Nepal

1996

29,156

8,429

8,082

46,576

2

Nepal

2001

28,955

8,726

8,602

47,523

3

Nepal

2006

26,394

10,793

8,707

44,057

4

Nepal

2011

26,615

12,674

10,826

49,791

5

Nepal

2016

26,028

12,862

11,040

49,064

6

Nepal

2022

27,613

14,845

13,786

57,278

We use the following variables for the pooled data analysis:

  • Dependent variable
    • infantd = Index child died during infancy period (0-11 months)
  • Main Independent variable
    • sibsurv_nmv = Survival status of preceding child (Death scarring)
    • binterval_3c_nmv_opp = Birth interval preceding to index child
  • Independent variables [CHILD LEVEL]
    • cyob10y_opp = Birth cohort of index child
    • bord_c = Birth order of index child
    • sex_fm = Gender of index child
    • season = Season during birth
  • Independent variables [MOTHER/PARENT LEVEL]
    • myob_opp = Birth cohort of mother
    • macb_c_opp = Mother’s age during birth of index child
    • medu_opp = Mother’s Level of education
    • fedu_opp = Father’s level of education
  • Independent variables [HOUSEHOLD LEVEL]
    • religion = Religion
    • nat_lang = Native language of respondent
    • wi_qt_opp = Household wealth quintile
    • hhgen_2c_opp = Generations in household
    • hhstruc_opp = Household structure
    • head_sex_fm = Sex of HH head
  • Independent variables [COMMUNITY LEVEL]
    • por = Place of residence of the household
    • ecoreg = Ecological region

Note: (a) Crossed names indicates variable not included.

Data import

We will directly import the nested tibble here. The code for dataset preparation is in the “daprep-v01_npdhs.R” script file.

# Here we import the tibbles to be used for dataset checking
# Import the npbr nested tibble
npbr1_pre_tmp0 <- read_rds(file = here("website_data", "npbr1_nest0.rds"))
# Import the nphr nested tibble
nphr1_pre_tmp0 <- read_rds(file = here("website_data", "nphr1_nest0.rds"))
# Import the nppr nested tibble
nppr1_pre_tmp0 <- read_rds(file = here("website_data", "nppr1_nest0.rds"))

Nepal BR dataset use for variable creation

Checking the Women’s weight variable before harmonization

We will check the formatting of the v005 women’s weight variable before creating the pooled survey weight. For this we will use the labelled::look_for().

# First we create the data dictionary of v005 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v005 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v005) |> 
        look_for(details = "full") |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character() |> 
        select(-c(levels:n_na))
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v005)) |> 
  select(-pos) 
# Convert and view the tibble as flextable
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 2: Data dictionary of v005 variable across the npbr rounds

ctr_name

svy_year

variable

label

col_type

missing

unique_values

range

Nepal

1996

v005

sample weight

dbl

0

21

412612 - 1538711

Nepal

2001

v005

sample weight

dbl

0

23

345841 - 1667756

Nepal

2006

v005

sample weight

dbl

0

260

63525 - 5297300

Nepal

2011

v005

women's individual sample weight (6 decimals)

dbl

0

25

103855 - 2512923

Nepal

2016

v005

women's individual sample weight (6 decimals)

dbl

0

381

125730 - 6581418

Nepal

2022

v005

women's individual sample weight (6 decimals)

dbl

0

473

168774 - 3703774

The women’s weight variables are in numeric class and have no missing values. Therefore, we need not reformat them. Hence we directly use it for preparing the pooled survey weight. NOTE that, the women’s weight for the Nepal 1996, 2001 and 2011 rounds have few unique values. This could be because there might have been fewer sampling units in the secondary stage.

Checking the ID variables before harmonization

Here we check the formatting of the variables using which we will prepare the ID variables for the pooled Nepal birth history recode (br) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all npbr datasets.
# First we create a data dictionary of the npbr datasets in nested tibble.
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(
    npbr_data,
    \(df) {
      df |> 
        select(v001, v002, v003, bord, v021, v022, v023, v024) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and output the pooled data dictionary 
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 3: Data dictionary of variables to be used for ID creation across the npbr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Nepal

1996

1

v001

cluster number

dbl

0

253

101 - 7502

Nepal

2001

1

v001

cluster number

dbl

0

251

101 - 7502

Nepal

2006

1

v001

cluster number

dbl

0

260

101 - 7502

Nepal

2011

1

v001

cluster number

dbl

0

289

101 - 7502

Nepal

2016

1

v001

cluster number

dbl

0

383

1 - 383

Nepal

2022

1

v001

cluster number

dbl

0

476

1 - 476

Nepal

1996

2

v002

household number

dbl

0

465

1 - 774

Nepal

2001

2

v002

household number

dbl

0

548

1 - 9006

Nepal

2006

2

v002

household number

dbl

0

510

1 - 1319

Nepal

2011

2

v002

household number

dbl

0

576

1 - 1403

Nepal

2016

2

v002

household number

dbl

0

382

1 - 963

Nepal

2022

2

v002

household number

dbl

0

321

1 - 505

Nepal

1996

3

v003

respondent's line number

dbl

0

26

1 - 27

Nepal

2001

3

v003

respondent's line number

dbl

0

24

1 - 26

Nepal

2006

3

v003

respondent's line number

dbl

0

22

1 - 29

Nepal

2011

3

v003

respondent's line number

dbl

0

21

1 - 26

Nepal

2016

3

v003

respondent's line number

dbl

0

25

1 - 33

Nepal

2022

3

v003

respondent's line number

dbl

0

18

1 - 21

Nepal

1996

4

bord

birth order number

dbl

0

16

1 - 16

Nepal

2001

4

bord

birth order number

dbl

0

14

1 - 14

Nepal

2006

4

bord

birth order number

dbl

0

16

1 - 16

Nepal

2011

4

bord

birth order number

dbl

0

14

1 - 14

Nepal

2016

4

bord

birth order number

dbl

0

15

1 - 15

Nepal

2022

4

bord

birth order number

dbl

0

12

1 - 12

Nepal

1996

5

v021

primary sampling unit

dbl

0

253

101 - 7502

Nepal

2001

5

v021

primary sampling unit

dbl

0

251

101 - 7502

Nepal

2006

5

v021

primary sampling unit

dbl

0

260

101 - 7502

Nepal

2011

5

v021

primary sampling unit

dbl

0

289

101 - 7502

Nepal

2016

5

v021

primary sampling unit

dbl

0

383

1 - 383

Nepal

2022

5

v021

primary sampling unit

dbl

0

476

1 - 476

Nepal

1996

6

v022

sample stratum number

dbl

0

145

51 - 3751

Nepal

2001

6

v022

sample stratum number

dbl

0

144

51 - 3751

Nepal

2006

6

v022

sample stratum number

dbl

0

117

1 - 118

Nepal

2011

6

v022

sample strata for sampling errors

dbl+lbl

0

25

1 - 25

Nepal

2016

6

v022

sample strata for sampling errors

dbl+lbl

0

14

1 - 14

Nepal

2022

6

v022

sample strata for sampling errors

dbl+lbl

0

14

1 - 14

Nepal

1996

7

v023

sample domain

dbl+lbl

0

1

0 - 0

Nepal

2001

7

v023

sample domain

dbl+lbl

0

13

1 - 13

Nepal

2006

7

v023

sample domain

dbl+lbl

0

13

1 - 13

Nepal

2011

7

v023

stratification used in sample design

dbl+lbl

0

13

1 - 13

Nepal

2016

7

v023

stratification used in sample design

dbl+lbl

0

14

1 - 14

Nepal

2022

7

v023

stratification used in sample design

dbl+lbl

0

14

1 - 14

Nepal

1996

8

v024

region

dbl+lbl

0

5

1 - 5

Nepal

2001

8

v024

region

dbl+lbl

0

5

1 - 5

Nepal

2006

8

v024

region

dbl+lbl

0

5

1 - 5

Nepal

2011

8

v024

region

dbl+lbl

0

3

1 - 3

Nepal

2016

8

v024

province

dbl+lbl

0

7

1 - 7

Nepal

2022

8

v024

province

dbl+lbl

0

7

1 - 7

From the above we can see that v023 and v024 are of labelled class, while the rest are in numeric class. Therefore, we will check the numeric and labelled variables in different ways. Note that although survey year is a constituent ID variable we have not checked it. It is imperative that survey year would be a 4-digit variable.

Numeric ID variables check

First, let’s find out the required length of the numeric ID variables by checking the maximum values of the constituent ID variable across the Nepal DHS datasets. Here we estimate the summary stats of numeric constituent variables using skim_without_charts().

# Check the summary stats for ID vars using skimr in each npbr dataset.
# First we estimate the summary stats using skim_without_charts().
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(
    skim_id_num = map(
      npbr_data,
      function(df) {
        df |> 
          select(v001, v002, v003, bord, v021, v022) |> 
          skim_without_charts() |> 
          as_tibble() |> 
          select(-c(skim_type, n_missing, complete_rate)) |> 
          rename(
            variable = 1,
            mean = 2,
            sd = 3,
            min = 4,
            p25 = 5,
            p50 = 6,
            p75 = 7,
            max = 8
          )
      }
    )
  )
npbr1_pre_tmp1

Next, we check the summary stats of numeric variables by variable name-wise.

# Now we unnest the nested tibble so that we can compare the variable length 
# across the npbr datasets.
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(skim_id_num)) |> 
  arrange(variable, svy_year) |> 
  # change the decimal places of selected variables
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd),
    p75 = sprintf("%.0f", p75)
  )
# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 4: Summary statistics of the numeric ID variables

ctr_name

svy_year

variable

mean

sd

min

p25

p50

p75

max

Nepal

1996

bord

3.2

2.1

1

1

3

4

16

Nepal

2001

bord

3.0

2.0

1

1

3

4

14

Nepal

2006

bord

2.8

1.9

1

1

2

4

16

Nepal

2011

bord

2.5

1.7

1

1

2

3

14

Nepal

2016

bord

2.4

1.5

1

1

2

3

15

Nepal

2022

bord

2.2

1.4

1

1

2

3

12

Nepal

1996

v001

3895.8

2194.2

101

2001

3803

5702

7502

Nepal

2001

v001

3812.3

2332.9

101

1706

3601

5803

7502

Nepal

2006

v001

3914.3

2254.5

101

1804

3902

5803

7502

Nepal

2011

v001

3990.4

2319.5

101

1901

4301

5905

7502

Nepal

2016

v001

199.7

113.1

1

95

208

302

383

Nepal

2022

v001

245.8

140.6

1

116

253

373

476

Nepal

1996

v002

81.0

91.3

1

26

55

98

774

Nepal

2001

v002

214.4

1009.0

1

33

69

128

9006

Nepal

2006

v002

95.4

102.0

1

32

68

126

1319

Nepal

2011

v002

120.8

120.7

1

44

91

164

1403

Nepal

2016

v002

83.6

71.7

1

30

66

122

963

Nepal

2022

v002

79.7

66.0

1

29

63

116

505

Nepal

1996

v003

2.7

2.2

1

2

2

2

27

Nepal

2001

v003

2.5

1.9

1

2

2

2

26

Nepal

2006

v003

2.4

1.8

1

2

2

2

29

Nepal

2011

v003

2.2

1.5

1

2

2

2

26

Nepal

2016

v003

2.2

1.6

1

1

2

2

33

Nepal

2022

v003

2.2

1.4

1

1

2

2

21

Nepal

1996

v021

3895.8

2194.2

101

2001

3803

5702

7502

Nepal

2001

v021

3812.3

2332.9

101

1706

3601

5803

7502

Nepal

2006

v021

3914.3

2254.5

101

1804

3902

5803

7502

Nepal

2011

v021

3990.4

2319.5

101

1901

4301

5905

7502

Nepal

2016

v021

199.7

113.1

1

95

208

302

383

Nepal

2022

v021

245.8

140.6

1

116

253

373

476

Nepal

1996

v022

1948.2

1097.1

51

1001

1902

2851

3751

Nepal

2001

v022

1906.4

1166.5

51

853

1801

2902

3751

Nepal

2006

v022

62.2

34.3

1

31

63

92

118

Nepal

2011

v022

13.6

7.1

1

8

14

19

25

Nepal

2016

v022

7.7

4.1

1

4

8

11

14

Nepal

2022

v022

7.5

4.2

1

4

8

11

14

Now we find out the required length of the numeric ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the numeric ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

# Processing the above nested tibble further
npbr1_pre_tmp3 <- npbr1_pre_tmp2 |> 
  group_by(variable) |> 
  # find the minimum and maximum values across surveys 
  summarize(
    min_val = min(min),
    max_val = max(max)
  ) |> 
  mutate(
    # calculate the num of digits in the maximum values
    max_digits = nchar(as.character(max_val)),
    # convert char var to factor
    variable = fct(
      variable, 
      levels = c("v001", "v002", "v003", "bord", "v021", "v022")
    )
  ) |> 
  # sort the rows by factor levels 
  arrange(variable) |> 
  # add variable labels and relocate it after variable name.
  bind_cols(vlabel = c("cluster number", "household number", 
                       "respondent's line number", "birth order", 
                       "primary sampling unit", "sample strata for se")) |> 
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp3 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 5: The maximum length of numeric variables to be set across the npbr rounds for concatenating the ID variables

variable

vlabel

min_val

max_val

max_digits

v001

cluster number

1

7502

4

v002

household number

1

9006

4

v003

respondent's line number

1

33

2

bord

birth order

1

16

2

v021

primary sampling unit

1

7502

4

v022

sample strata for se

1

3751

4

Labelled ID variables check

First we check the labels in sub-national region variable coded as v024 across the npbr datasets. Let’s create a nested tibble of v024’s value labels.

# Create the data dictionary for v024 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v024 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v024) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1

Now we view the value labels of v024 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary 
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v024)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |> 
  # Show the variable name in a col
  mutate(var_name = "v024", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 6: Data dictionary of v024 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v024

1

[1] eastern

[1] eastern

[1] eastern

[1] mountain

[1] province 1

[1] koshi

Nepal

v024

2

[2] central

[2] central

[2] central

[2] hill

[2] province 2

[2] madhesh province

Nepal

v024

3

[3] western

[3] western

[3] western

[3] terai

[3] province 3

[3] bagmati province

Nepal

v024

4

[4] midwestern

[4] mid-western

[4] mid-western

[4] province 4

[4] gandaki province

Nepal

v024

5

[5] farwestern

[5] far-western

[5] far-western

[5] province 5

[5] lumbini province

Nepal

v024

6

[6] province 6

[6] karnali province

Nepal

v024

7

[7] province 7

[7] sudurpashchim province

NOTE: The sub-national region var, v024 has different label values in each survey year. It was same for npbr 1996, 2001 and 2006. After that the label values are different for each survey round.
VERD: In this analysis, we do not use the region var in the ID var.


Secondly, we check the labels in v023 variable that denotes the stratifications used for sampling design. First we create a nested tibble of v023’s value labels.

# Create the data dictionary for v023 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v023 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v023) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1

Now we view the value labels of v023 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v023)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v023", .before = 2) 

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 7: Data dictionary of v023 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v023

0

[0] national

Nepal

v023

1

[1] country specific

[1] eastern mountain

[1] eastern mountain

[1] eastern mountain

[1] province 1 - urban

[1] koshi - urban

Nepal

v023

2

[2] central mountain

[2] central mountain

[2] central mountain

[2] province 1 - rural

[2] koshi - rural

Nepal

v023

3

[3] western mountain

[3] western mountain

[3] western mountain

[3] province 2 - urban

[3] madhesh province - urban

Nepal

v023

4

[4] eastern hill

[4] eastern hill

[4] eastern hill

[4] province 2 - rural

[4] madhesh province - rural

Nepal

v023

5

[5] central hill

[5] central hill

[5] central hill

[5] province 3 - urban

[5] bagmati province - urban

Nepal

v023

6

[6] western hill

[6] western hill

[6] western hill

[6] province 3 - rural

[6] bagmati province - rural

Nepal

v023

7

[7] mid-western hill

[7] mid-western hill

[7] mid-western hill

[7] province 4 - urban

[7] gandaki province - urban

Nepal

v023

8

[8] far-western hill

[8] far-western hill

[8] far-western hill

[8] province 4 - rural

[8] gandaki province - rural

Nepal

v023

9

[9] eastern terai

[9] eastern terai

[9] eastern terai

[9] province 5 - urban

[9] lumbini province - urban

Nepal

v023

10

[10] central terai

[10] central terai

[10] central terai

[10] province 5 - rural

[10] lumbini province - rural

Nepal

v023

11

[11] western terai

[11] western terai

[11] western terai

[11] province 6 - urban

[11] karnali province - urban

Nepal

v023

12

[12] mid-western terai

[12] mid-western terai

[12] mid-western terai

[12] province 6 - rural

[12] karnali province - rural

Nepal

v023

13

[13] far-western terai

[13] far-western terai

[13] far-western terai

[13] province 7 - urban

[13] sudurpashchim province - urban

Nepal

v023

14

[14] province 7 - rural

[14] sudurpashchim province - rural

NOTE: The labels of v023 are different across the survey rounds.
VERD: Therefore we cannot use v023 in the ID variable preparation.

Altly, we can use the ecological region variable (secoreg) in the ID var. We will check for this in future.

Checking the Birth History variables before harmonization

Undoubtedly the birth history variables are important for this study objective. Therefore, we need to scrutinize all the birth history variables before using them to prepare harmonized variables for the pooled dataset.

# We check the birth history vars in all npbr datasets.
# First we create a data dictionary in nested tibble.
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |>
  mutate(lookfor_bhvars = map(
    npbr_data,
    \(df) {
      df |> 
        select(bidx, matches("^b[0-9]+")) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_bhvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 8: Data dictionary of birth history variables across the npbr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Nepal

1996

1

bidx

birth column number

dbl

0

16

1 - 16

Nepal

2001

1

bidx

birth column number

dbl

0

14

1 - 14

Nepal

2006

1

bidx

birth column number

dbl

0

16

1 - 16

Nepal

2011

1

bidx

birth column number

dbl

0

14

1 - 14

Nepal

2016

1

bidx

birth column number

dbl

0

15

1 - 15

Nepal

2022

1

bidx

birth column number

dbl

0

12

1 - 12

Nepal

1996

2

b0

child is twin

dbl+lbl

0

4

0 - 3

Nepal

2001

2

b0

child is twin

dbl+lbl

0

4

0 - 3

Nepal

2006

2

b0

child is twin

dbl+lbl

0

4

0 - 3

Nepal

2011

2

b0

child is twin

dbl+lbl

0

3

0 - 2

Nepal

2016

2

b0

child is twin

dbl+lbl

0

4

0 - 3

Nepal

2022

2

b0

child is twin

dbl+lbl

0

4

0 - 3

Nepal

1996

3

b1

month of birth

dbl

0

12

1 - 12

Nepal

2001

3

b1

month of birth

dbl

0

12

1 - 12

Nepal

2006

3

b1

month of birth

dbl

0

12

1 - 12

Nepal

2011

3

b1

month of birth

dbl+lbl

0

12

1 - 12

Nepal

2016

3

b1

month of birth

dbl

0

12

1 - 12

Nepal

2022

3

b1

month of birth

dbl

0

12

1 - 12

Nepal

1996

4

b2

year of birth

dbl

0

38

16 - 53

Nepal

2001

4

b2

year of birth

dbl

0

36

2023 - 2058

Nepal

2006

4

b2

year of birth

dbl

0

38

2026 - 2063

Nepal

2011

4

b2

year of birth

dbl

0

38

2030 - 2068

Nepal

2016

4

b2

year of birth

dbl

0

37

2036 - 2073

Nepal

2022

4

b2

year of birth

dbl

0

38

2042 - 2079

Nepal

1996

5

b3

date of birth (cmc)

dbl

0

424

198 - 638

Nepal

2001

5

b3

date of birth (cmc)

dbl

0

410

1479 - 1898

Nepal

2006

5

b3

date of birth (cmc)

dbl

0

415

1523 - 1960

Nepal

2011

5

b3

date of birth (cmc)

dbl

0

413

1561 - 2018

Nepal

2016

5

b3

date of birth (cmc)

dbl

0

417

1637 - 2085

Nepal

2022

5

b3

date of birth (cmc)

dbl

0

413

1710 - 2150

Nepal

1996

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Nepal

2001

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Nepal

2006

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Nepal

2011

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Nepal

2016

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Nepal

2022

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Nepal

1996

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Nepal

2001

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Nepal

2006

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Nepal

2011

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Nepal

2016

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Nepal

2022

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Nepal

1996

8

b6

age at death

dbl+lbl

23610

85

100 - 330

Nepal

2001

8

b6

age at death

dbl+lbl

24407

81

100 - 326

Nepal

2006

8

b6

age at death

dbl+lbl

23028

108

100 - 324

Nepal

2011

8

b6

age at death

dbl+lbl

23920

86

100 - 330

Nepal

2016

8

b6

age at death

dbl+lbl

23906

78

100 - 328

Nepal

2022

8

b6

age at death

dbl+lbl

25749

80

100 - 328

Nepal

1996

9

b7

age at death (months-imputed)

dbl

23608

61

0 - 360

Nepal

2001

9

b7

age at death (months-imputed)

dbl

24407

54

0 - 312

Nepal

2006

9

b7

age at death (months-imputed)

dbl

23028

80

0 - 288

Nepal

2011

9

b7

age at death (months, imputed)

dbl

23920

54

0 - 360

Nepal

2016

9

b7

age at death (months, imputed)

dbl

23906

48

0 - 336

Nepal

2022

9

b7

age at death (months, imputed)

dbl

25749

50

0 - 336

Nepal

1996

10

b8

current age of child

dbl

5548

38

0 - 36

Nepal

2001

10

b8

current age of child

dbl

4548

36

0 - 34

Nepal

2006

10

b8

current age of child

dbl

3366

37

0 - 35

Nepal

2011

10

b8

current age of child

dbl

2695

36

0 - 34

Nepal

2016

10

b8

current age of child

dbl

2122

38

0 - 36

Nepal

2022

10

b8

current age of child

dbl

1864

37

0 - 35

Nepal

1996

11

b9

who child lives with

dbl+lbl

5548

3

0 - 4

Nepal

2001

11

b9

child lives with whom

dbl+lbl

4548

3

0 - 4

Nepal

2006

11

b9

child lives with whom

dbl+lbl

3366

3

0 - 4

Nepal

2011

11

b9

child lives with whom

dbl+lbl

2695

3

0 - 4

Nepal

2016

11

b9

child lives with whom

dbl+lbl

2122

3

0 - 4

Nepal

2022

11

b9

child lives with whom

dbl+lbl

1864

3

0 - 4

Nepal

1996

12

b10

completeness of information

dbl+lbl

0

4

1 - 5

Nepal

2001

12

b10

completeness of information

dbl+lbl

0

5

1 - 8

Nepal

2006

12

b10

completeness of information

dbl+lbl

0

6

1 - 8

Nepal

2011

12

b10

completeness of information

dbl+lbl

0

4

1 - 8

Nepal

2016

12

b10

completeness of information

dbl+lbl

0

2

0 - 3

Nepal

2022

12

b10

completeness of information

dbl+lbl

0

5

0 - 6

Nepal

1996

13

b11

preceding birth interval

dbl

7515

152

6 - 197

Nepal

2001

13

b11

preceding birth interval

dbl

7805

154

9 - 235

Nepal

2006

13

b11

preceding birth interval

dbl

7825

151

9 - 319

Nepal

2011

13

b11

preceding birth interval (months)

dbl

8849

163

9 - 293

Nepal

2016

13

b11

preceding birth interval (months)

dbl

9269

170

6 - 221

Nepal

2022

13

b11

preceding birth interval (months)

dbl

10784

185

6 - 249

Nepal

1996

14

b12

succeeding birth interval

dbl

7534

152

6 - 197

Nepal

2001

14

b12

succeeding birth interval

dbl

7835

154

9 - 235

Nepal

2006

14

b12

succeeding birth interval

dbl

7868

151

9 - 319

Nepal

2011

14

b12

succeeding birth interval (months)

dbl

8876

163

9 - 293

Nepal

2016

14

b12

succeeding birth interval (months)

dbl

9320

170

6 - 221

Nepal

2022

14

b12

succeeding birth interval (months)

dbl

10837

185

6 - 249

Nepal

1996

15

b13

flag for age at death

dbl+lbl

23608

8

0 - 8

Nepal

2001

15

b13

flag for age at death

dbl+lbl

24407

4

0 - 7

Nepal

2006

15

b13

flag for age at death

dbl+lbl

23028

5

0 - 9

Nepal

2011

15

b13

flag for age at death

dbl+lbl

23920

4

0 - 6

Nepal

2016

15

b13

flag for age at death

dbl+lbl

23906

2

0 - 0

Nepal

2022

15

b13

flag for age at death

dbl+lbl

25749

2

0 - 0

Nepal

1996

16

b14

birth interval >= 4 years

dbl+lbl

7037

3

0 - 1

Nepal

2001

16

b15

live birth between births -na

dbl

28955

1

Nepal

2006

16

b15

na-live birth between births

dbl+lbl

26394

1

Nepal

2011

16

b15

live birth between births

dbl+lbl

8168

2

0 - 0

Nepal

2016

16

b15

live birth between births

dbl+lbl

8430

2

0 - 0

Nepal

2022

16

b15

live birth between births

dbl+lbl

0

2

0 - 1

Nepal

1996

17

b15

live birth between births

dbl+lbl

25954

3

0 - 1

Nepal

2001

17

b16

child's line number in household

dbl+lbl

4548

30

0 - 28

Nepal

2006

17

b16

child's line number in household

dbl+lbl

3366

30

0 - 30

Nepal

2011

17

b16

child's line number in household

dbl+lbl

2695

31

0 - 31

Nepal

2016

17

b16

child's line number in household

dbl+lbl

2122

33

0 - 37

Nepal

2022

17

b16

child's line number in household

dbl+lbl

1864

23

0 - 23

Nepal

2001

18

b0_92

child is twin

dbl+lbl

0

4

0 - 3

Nepal

2006

18

b0_x

child is twin

dbl+lbl

0

4

0 - 3

Nepal

2016

18

b17

day of birth

dbl

0

32

1 - 32

Nepal

2022

18

b17

day of birth

dbl

0

32

1 - 32

Nepal

2001

19

b1_92

month of birth/ending of pregnancy

dbl

0

12

1 - 12

Nepal

2006

19

b1_x

month of birth/ending of pregnancy

dbl

0

12

1 - 12

Nepal

2016

19

b18

century day code of birth (cdc)

dbl

0

9472

13291 - 26925

Nepal

2022

19

b18

century day code of birth (cdc)

dbl

0

9720

15525 - 28916

Nepal

2001

20

b2_92

year of birth/end of pregnancy

dbl

0

36

2023 - 2058

Nepal

2006

20

b2_x

year of birth/end of pregnancy

dbl

0

38

2026 - 2063

Nepal

2016

20

b19

current age of child in months (months since birth for dead children)

dbl

0

415

0 - 442

Nepal

2022

20

b19

current age of child in months (months since birth for dead children)

dbl

0

410

0 - 434

Nepal

2001

21

b3_92

date of birth/end of pregnancy (cmc)

dbl

0

410

1479 - 1898

Nepal

2006

21

b3_x

date of birth/end of pregnancy (cmc)

dbl

0

415

1523 - 1960

Nepal

2016

21

b20

duration of pregnancy

dbl

20476

7

6 - 11

Nepal

2022

21

b20

duration of pregnancy in months

dbl

0

6

5 - 10

Nepal

2001

22

b4_92

sex of child

dbl+lbl

0

2

1 - 2

Nepal

2006

22

b4_x

sex of child

dbl+lbl

0

2

1 - 2

Nepal

2022

22

b21

duration of pregnancy

dbl

0

10

131 - 210

Nepal

2001

23

b5_92

child is alive

dbl+lbl

0

2

0 - 1

Nepal

2006

23

b5_x

child is alive

dbl+lbl

0

2

0 - 1

Nepal

2001

24

b6_92

age at death

dbl+lbl

24407

81

100 - 326

Nepal

2006

24

b6_x

age at death

dbl+lbl

23028

108

100 - 324

Nepal

2001

25

b7_92

age at death (months-imputed)

dbl

24407

54

0 - 312

Nepal

2006

25

b7_x

age at death (months-imputed)

dbl

23028

80

0 - 288

Nepal

2001

26

b8_92

current age of child

dbl

4548

36

0 - 34

Nepal

2006

26

b8_x

current age of child

dbl

3366

37

0 - 35

Nepal

2001

27

b9_92

child lives with whom

dbl+lbl

4548

3

0 - 4

Nepal

2006

27

b9_x

child lives with whom

dbl+lbl

3366

3

0 - 4

Nepal

2001

28

b10_92

completeness of information

dbl+lbl

0

5

1 - 8

Nepal

2006

28

b10_x

completeness of information

dbl+lbl

0

6

1 - 8

Nepal

2001

29

b11_92

preceding birth interval

dbl

7315

149

9 - 235

Nepal

2006

29

b11_x

preceding birth interval

dbl

7315

149

9 - 260

Nepal

2001

30

b12_92

succeeding birth interval

dbl

7454

157

2 - 235

Nepal

2006

30

b12_x

succeeding birth interval

dbl

7326

157

3 - 260

Nepal

2001

31

b13_92

flag for age at death

dbl+lbl

24407

4

0 - 7

Nepal

2006

31

b13_x

flag for age at death

dbl+lbl

23028

5

0 - 9

Nepal

2001

32

b16_92

child's line number in household

dbl+lbl

4548

30

0 - 28

Nepal

2006

32

b16_x

child's line number in household

dbl+lbl

3366

30

0 - 30

From the above table we get an overall snapshot of the birth history variables. We see that the variables b1-b13 are common in all the six npbr datasets. Notably npbr 2001 and 2006 have some extra variables that are not available in other rounds. Next, we look at the other labelled variables which are common across npbr in more details. We would like to see if the value labels of the common birth history variables are similar across the npbr datasets.

b0 - child is twin

We check the value labels of b0 variable that denotes whether the child is twin. First we create a nested tibble of b0’s value labels.

# Create the data dictionary for b0 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_b0 = map(
    npbr_data,
    \(df) {
      df |> 
        select(b0) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b0)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b0", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 9: Data dictionary of b0 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

b0

0

[0] single birth

[0] single birth

[0] single birth

[0] single birth

[0] single birth

[0] single birth

Nepal

b0

1

[1] 1st of multiple

[1] 1st of multiple

[1] 1st of multiple

[1] 1st of multiple

[1] 1st of multiple

[1] 1st of multiple

Nepal

b0

2

[2] 2nd of multiple

[2] 2nd of multiple

[2] 2nd of multiple

[2] 2nd of multiple

[2] 2nd of multiple

[2] 2nd of multiple

Nepal

b0

3

[3] 3rd of multiple

[3] 3rd of multiple

[3] 3rd of multiple

[3] 3rd of multiple

[3] 3rd of multiple

[3] 3rd of multiple

Nepal

b0

4

[4] 4th of multiple

[4] 4th of multiple

[4] 4th of multiple

[4] 4th of multiple

[4] 4th of multiple

[4] 4th of multiple

Nepal

b0

5

[5] 5th of multiple

[5] 5th of multiple

[5] 5th of multiple

[5] 5th of multiple

[5] 5th of multiple

[5] 5th of multiple

We can see the value labels of b0 in the above table. We see that the value labels are same across all the npbr datasets.

b1 - month of birth

We see that the b1 variable has value labels only for npbr 2011. Therefore, we check the value labels of the variable during this round.

# Create the data dictionary for b1 in npbr 2011
npbr1_pre_tmp0$npbr_data$npbr_2011 |> 
  select(b1) |> 
  look_for(details = "full") |> 
  lookfor_to_long_format() |> 
  convert_list_columns_to_character() |> 
  select(-c(pos, levels, class:n_na)) |> 
  qflextable() |> 
  autofit()
Table 10: Data dictionary of b1 in npbr 2011

variable

label

col_type

missing

value_labels

unique_values

range

b1

month of birth

dbl+lbl

0

[1] baisakh

12

1 - 12

b1

month of birth

dbl+lbl

0

[2] jestha

12

1 - 12

b1

month of birth

dbl+lbl

0

[3] ashad

12

1 - 12

b1

month of birth

dbl+lbl

0

[4] srawan

12

1 - 12

b1

month of birth

dbl+lbl

0

[5] bhadra

12

1 - 12

b1

month of birth

dbl+lbl

0

[6] aswin

12

1 - 12

b1

month of birth

dbl+lbl

0

[7] kartik

12

1 - 12

b1

month of birth

dbl+lbl

0

[8] mangsir

12

1 - 12

b1

month of birth

dbl+lbl

0

[9] poush

12

1 - 12

b1

month of birth

dbl+lbl

0

[10] magh

12

1 - 12

b1

month of birth

dbl+lbl

0

[11] falgun

12

1 - 12

b1

month of birth

dbl+lbl

0

[12] chaitra

12

1 - 12

Note that the birth months correspond to months in hindu calendar. The days of months do not correspond to the english calendar and this creates a problem when we will prepare the season during birth variable, later.

SOL: We can re-create the birth month variable from b3 - child’s dob (in cmc) by dividing b3 by 12 and taking the remainder as birth month.

b4 - sex of child

We check the value labels of b4 variable which gives the sex of the child. First we create a nested tibble of b4’s value labels.

# Create the data dictionary for b4 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_b4 = map(
    npbr_data,
    \(df) {
      df |> 
        select(b4) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b4)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b4", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 11: Data dictionary of b4 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

b4

1

[1] male

[1] male

[1] male

[1] male

[1] male

[1] male

Nepal

b4

2

[2] female

[2] female

[2] female

[2] female

[2] female

[2] female

We can see the value labels of b4 in the above table. The value labels are same across all the npbr datasets.

b5 - child is alive

We check the value labels of b5 variable which gives the survival status of the child. First we create a nested tibble of b5’s value labels.

# Create the data dictionary for b5 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_b5 = map(
    npbr_data,
    \(df) {
      df |> 
        select(b5) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b5)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b5", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 12: Data dictionary of b5 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

b5

0

[0] no

[0] no

[0] no

[0] no

[0] no

[0] no

Nepal

b5

1

[1] yes

[1] yes

[1] yes

[1] yes

[1] yes

[1] yes

The above table shows that the value labels of survival status of child are same across all the npbr datasets.

b6 - age at death

We check the value labels of b6 variable which shows the age at death of children. Note that this variable has many missing values across all npbr rounds as not all children experienced mortality throughout their lifetime. First we create a nested tibble of b6’s value labels.

# Create the data dictionary for b5 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_b6 = map(
    npbr_data,
    \(df) {
      df |> 
        select(b6) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b6)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b6", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 13: Data dictionary of b6 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

b6

100

[100] died on day of birth

[100] died on day of birth

[100] died on day of birth

Nepal

b6

101

[101] days: 1

[101] days: 1

[101] days: 1

Nepal

b6

199

[199] days: number missing

[199] days: number missing

[199] days: number missing

Nepal

b6

201

[201] months: 1

[201] months: 1

[201] months: 1

Nepal

b6

299

[299] months: number missing

[299] months: number missing

[299] months: number missing

Nepal

b6

301

[301] years: 1

[301] years: 1

[301] years: 1

Nepal

b6

399

[399] years: number missing

[399] years: number missing

[399] years: number missing

Nepal

b6

997

[997] inconsistent

[997] inconsistent

[997] inconsistent

[997] inconsistent

[997] inconsistent

[997] inconsistent

Nepal

b6

998

[998] don't know

[998] don't know

[998] don't know

[998] don't know

[998] don't know

[998] don't know

The above table shows that the value labels of age at death of child are in two groups. First, they are same for npbr 1996, 2001 and 2006 and and then for npbr 2011, 2016 and 2022.

b9 - child lives with whom

We check the value labels of b9 variable which gives info on who the child lives with. First we create a nested tibble of b9’s value labels.

# Create the data dictionary for b9 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_b9 = map(
    npbr_data,
    \(df) {
      df |> 
        select(b9) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b9)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b9", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 14: Data dictionary of b9 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

b9

0

[0] respondent

[0] respondent

[0] respondent

[0] respondent

[0] respondent

[0] respondent

Nepal

b9

1

[1] father

[1] father

[1] father

[1] father

[1] father

[1] father

Nepal

b9

2

[2] other relative

[2] other relative

[2] other relative

[2] other relative

[2] other relative

[2] other relative

Nepal

b9

3

[3] someone else

[3] someone else

[3] someone else

[3] someone else

[3] someone else

[3] someone else

Nepal

b9

4

[4] lives elsewhere

[4] lives elsewhere

[4] lives elsewhere

[4] lives elsewhere

[4] lives elsewhere

[4] lives elsewhere

We can see in the above table that the value labels of b9 are same across all the npbr datasets.

b10 - completeness of information

We check the value labels of b10 variable which gives the completeness of birth history information. First we create a nested tibble of b10’s value labels.

# Create the data dictionary for b10 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_b10 = map(
    npbr_data,
    \(df) {
      df |> 
        select(b10) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b10)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b10", .before = 2) 

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |>
  align(align = "left", part = "all") |> 
  autofit()
Table 15: Data dictionary of b10 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

b10

0

[0] month, year and day

[0] month, year and day

Nepal

b10

1

[1] month and year

[1] month and year

[1] month and year

[1] month and year - information complete

[1] month and year - information complete

[1] month and year - information complete

Nepal

b10

2

[2] month and age -y imp

[2] month and age -y imp

[2] month and age -y imp

[2] month and age - year imputed

[2] month and age - year imputed

[2] month and age - year imputed

Nepal

b10

3

[3] year and age - m imp

[3] year and age - m imp

[3] year and age - m imp

[3] year and age - month imputed

[3] year and age - month imputed

[3] year and age - month imputed

Nepal

b10

4

[4] y & age - y ignored

[4] y & age - y ignored

[4] y & age - y ignored

[4] year and age - year ignored

[4] year and age - year ignored

[4] year and age - year ignored

Nepal

b10

5

[5] year - a, m imp

[5] year - a, m imp

[5] year - a, m imp

[5] year - age/month imputed

[5] year - age/month imputed

[5] year - age/month imputed

Nepal

b10

6

[6] age - y, m imp

[6] age - y, m imp

[6] age - y, m imp

[6] age - year/month imputed

[6] age - year/month imputed

[6] age - year/month imputed

Nepal

b10

7

[7] month - a, y imp

[7] month - a, y imp

[7] month - a, y imp

[7] month - age/year imputed

[7] month - age/year imputed

[7] month - age/year imputed

Nepal

b10

8

[8] none - all imp

[8] none - all imp

[8] none - all imp

[8] none - all imputed

[8] none - all imputed

[8] none - all imputed

We can see in the above table that the value labels of b10 are same across npbr 1996, 2001, 2006 and 2011 datasets. Then they are same for npbr 2016 and 2022.

Checking the Common independent variables before harmonization

Next we start documenting the common independent variables. First we will check the data dictionary of the common independent variables. Then we will check them variable wise.

# We check the common independent vars in all npbr datasets.
# First we create the data dictionary in nested tibble.
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |>
  mutate(lookfor_comindvars = map(
    npbr_data,
    \(df) {
      df |> 
        # select the common independent variables
        select(v106, v011, v501, v701, v025, v151, v152) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_comindvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 16: Data dictionary of common independent variables across the npbr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Nepal

1996

1

v106

highest educational level

dbl+lbl

0

4

0 - 3

Nepal

2001

1

v106

highest educational level

dbl+lbl

0

4

0 - 3

Nepal

2006

1

v106

highest educational level

dbl+lbl

0

4

0 - 3

Nepal

2011

1

v106

highest educational level

dbl+lbl

0

5

0 - 8

Nepal

2016

1

v106

highest educational level

dbl+lbl

0

4

0 - 3

Nepal

2022

1

v106

highest educational level

dbl+lbl

0

4

0 - 3

Nepal

1996

2

v011

date of birth (cmc)

dbl

0

414

35 - 456

Nepal

2001

2

v011

date of birth (cmc)

dbl

0

409

1296 - 1708

Nepal

2006

2

v011

date of birth (cmc)

dbl

0

409

1356 - 1771

Nepal

2011

2

v011

date of birth (cmc)

dbl

0

407

1415 - 1830

Nepal

2016

2

v011

date of birth (cmc)

dbl

0

407

1480 - 1903

Nepal

2022

2

v011

date of birth (cmc)

dbl

0

410

1546 - 1962

Nepal

1996

3

v501

current marital status

dbl+lbl

0

4

1 - 5

Nepal

2001

3

v501

current marital status

dbl+lbl

0

4

1 - 5

Nepal

2006

3

v501

current marital status

dbl+lbl

0

4

1 - 5

Nepal

2011

3

v501

current marital status

dbl+lbl

0

5

0 - 5

Nepal

2016

3

v501

current marital status

dbl+lbl

0

6

0 - 5

Nepal

2022

3

v501

current marital status

dbl+lbl

0

6

0 - 5

Nepal

1996

4

v701

partner's education level

dbl+lbl

27

5

0 - 3

Nepal

2001

4

v701

partner's education level

dbl+lbl

0

5

0 - 8

Nepal

2006

4

v701

partner's education level

dbl+lbl

0

5

0 - 8

Nepal

2011

4

v701

husband/partner's education level

dbl+lbl

3

6

0 - 8

Nepal

2016

4

v701

husband/partner's education level

dbl+lbl

942

6

0 - 8

Nepal

2022

4

v701

husband/partner's education level

dbl+lbl

1198

6

0 - 8

Nepal

1996

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Nepal

2001

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Nepal

2006

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Nepal

2011

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Nepal

2016

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Nepal

2022

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Nepal

1996

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Nepal

2001

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Nepal

2006

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Nepal

2011

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Nepal

2016

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Nepal

2022

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Nepal

1996

7

v152

age of household head

dbl

0

81

12 - 95

Nepal

2001

7

v152

age of household head

dbl+lbl

0

77

14 - 95

Nepal

2006

7

v152

age of household head

dbl+lbl

0

75

16 - 96

Nepal

2011

7

v152

age of household head

dbl+lbl

0

76

16 - 95

Nepal

2016

7

v152

age of household head

dbl+lbl

0

73

17 - 95

Nepal

2022

7

v152

age of household head

dbl+lbl

0

76

17 - 95

From the above table we get an overall snapshot of the common independent variables. We see that majority of the have different number of value labels across the six npbr datasets. Only v025 and v151 have the same number of value labels across npbr rounds. Next, we look at the labelled variables among these common variables in more details. We would like to see if the value labels and codes of the common independent variables are similar across the npbr datasets.

v106 - Mother’s education level

We check the value labels of v106 variable that denotes the highest education level of mother. First we create a nested tibble of v106’s value labels.

# Create the data dictionary for v106 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v106 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v106) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v106)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v106", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 17: Data dictionary of v106 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v106

0

[0] no education

[0] no education

[0] no education

[0] no education

[0] no education

[0] no education

Nepal

v106

1

[1] primary or less

[1] primary

[1] primary

[1] primary

[1] primary

[1] basic

Nepal

v106

2

[2] some secondary

[2] secondary

[2] secondary

[2] secondary

[2] secondary

[2] secondary

Nepal

v106

3

[3] slc and above

[3] higher

[3] higher

[3] higher

[3] higher

[3] higher

Nepal

v106

4

[4]

Nepal

v106

8

[8] don't know

We can see the value labels of v106 are mostly similar except for npbr 1996 and 2011 datasets.

v011 - Date of birth (in CMC)

The v011 variable, which has the dob of mothers in cmc, is a numeric variable. Let’s check the range of these values in further details such as checking for outliers. First let’s create a nested tibble of the summary statistics of v011 variable.

# Create the summary statistics for v011 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(skim_v011 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v011) |> 
        skim_without_charts() |> 
        as_tibble() |> 
        select(-c(skim_type, complete_rate)) |> 
        rename(
          variable = 1,
          n_miss = 2,
          mean = 3,
          sd = 4,
          min = 5,
          p25 = 6,
          p50 = 7,
          p75 = 8,
          max = 9
        )
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(skim_v011)) |> 
  # Make variable values have one decimal point 
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd)
  )

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 18: Data dictionary of v011 across the npbr rounds

ctr_name

svy_year

variable

n_miss

mean

sd

min

p25

p50

p75

max

Nepal

1996

v011

0

206.0

97.9

35

126

203

286

456

Nepal

2001

v011

0

1465.4

97.1

1296

1384

1465

1541

1708

Nepal

2006

v011

0

1525.7

98.4

1356

1444

1520

1607

1771

Nepal

2011

v011

0

1583.6

95.0

1415

1505

1580

1660

1830

Nepal

2016

v011

0

1646.9

96.7

1480

1565

1642

1723

1903

Nepal

2022

v011

0

1711.0

96.7

1546

1631

1706

1787

1962

v501 - Mother’s marital status

We check the value labels of v501 variable which gives the current marital status of mother. First we create a nested tibble of v501’s value labels.

# Create the data dictionary for v501 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v501 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v501) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v501)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v501", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 19: Data dictionary of v501 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v501

0

[0] never married

[0] never married

[0] never married

[0] never in union

[0] never in union

[0] never in union

Nepal

v501

1

[1] married

[1] married

[1] married

[1] married

[1] married

[1] married

Nepal

v501

2

[2] living together

[2] living together

[2] living together

[2] living with partner

[2] living with partner

[2] living with partner

Nepal

v501

3

[3] widowed

[3] widowed

[3] widowed

[3] widowed

[3] widowed

[3] widowed

Nepal

v501

4

[4] divorced

[4] divorced

[4] divorced

[4] divorced

[4] divorced

[4] divorced

Nepal

v501

5

[5] not living together

[5] not living together

[5] not living together

[5] no longer living together/separated

[5] no longer living together/separated

[5] no longer living together/separated

All the npbr rounds have 5 value labels. The npbr 1996, 2001 and 2006 rounds have a set of similar value label texts. Then npbr 2011, 2016 and 2022 have another set of similar value labels.

v701 - Husband/Partner’s education level

We check the value labels of v701 variable which gives the current marital status of mother. First we create a nested tibble of v701’s value labels.

# Create the data dictionary for v701 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v701 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v701) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v701)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v701", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 20: Data dictionary of v701 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v701

0

[0] no education

[0] no education

[0] no education

[0] no education

[0] no education

[0] no education

Nepal

v701

1

[1] primary

[1] primary

[1] primary

[1] primary

[1] primary

[1] basic

Nepal

v701

2

[2] secondary

[2] secondary

[2] secondary

[2] secondary

[2] secondary

[2] secondary

Nepal

v701

3

[3] higher

[3] higher

[3] higher

[3] higher

[3] higher

[3] higher

Nepal

v701

8

[8] don't know

[8] don't know

[8] don't know

[8] don't know

[8] don't know

[8] don't know

All the npbr rounds have 5 value labels. The npbr 1996, 2001 and 2006 rounds and npbr 2011, 2016 and 2022 have a similar set of value labels with a difference in wording among them.

v025 - Type of place of residence

We check the value labels of v025 variable which shows if a household belongs to rural or urban psu. First we create a nested tibble of v025’s value labels.

# Create the data dictionary for v025 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v025 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v025) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_v025)) |> 
  unnest(cols = c(lookfor_v025)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v025", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 21: Data dictionary of v025 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v025

1

[1] urban

[1] urban

[1] urban

[1] urban

[1] urban

[1] urban

Nepal

v025

2

[2] rural

[2] rural

[2] rural

[2] rural

[2] rural

[2] rural

The values labels and codes for v025 are same across all the npbr rounds.

v151 - Sex of household head

We check the value labels of v151 variable which gives the gender of the household head. First we create a nested tibble of v151’s value labels.

# Create the data dictionary for v151 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v151 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v151) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v151)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v151", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 22: Data dictionary of v151 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v151

1

[1] male

[1] male

[1] male

[1] male

[1] male

[1] male

Nepal

v151

2

[2] female

[2] female

[2] female

[2] female

[2] female

[2] female

The values labels and codes for v151 are same across all the npbr rounds.

v152 - Age of household head

Interestingly, we see v152 (a continuous variable) has value labels for all rounds except npbr 1996. Therefore, we check the value labels of v152 for those rounds. First we create a nested tibble of v152’s value labels.

# Create the data dictionary for v152 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  filter(svy_year != 1996) |> 
  mutate(lookfor_v152 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v152) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v152)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v152", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 23: Data dictionary of v152 across the npbr rounds

ctr_name

var_name

label_num

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v152

97

[97] 97+

[97] 97+

[97] 97+

[97] 97+

[97] 97+

Nepal

v152

98

[98] dk

[98] dk

[98] don't know

[98] don't know

[98] don't know

We can see that the value labels of v152 are mostly for missing values. However, since v152 has no missing values across the npbr rounds, we need not be concerned about them.

Checking the Social group variables before harmonization

Now we document the social group variables and then harmonize them. Upon manually checking the full data dictionaries of each npbr dataset we find the following variables - religion, ethnicity, and, native language. First we will check the data dictionary of these social group variables. Then we will check them variable wise.

# We check the social group vars in all npbr datasets.
# First we create the data dictionary in nested tibble.
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |>
  mutate(lookfor_socgrp = map(
    npbr_data,
    \(df) {
      df |> 
        # select the social group variables
        select(
          v130, v131, 
          matches("slang[nr]|snlang|slnative|v045c")
        ) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_socgrp)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 24: Data dictionary of social group variables across the npbr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Nepal

1996

1

v130

religion

dbl+lbl

55

6

1 - 5

Nepal

2001

1

v130

religion

dbl+lbl

0

5

1 - 6

Nepal

2006

1

v130

religion

dbl+lbl

0

6

1 - 6

Nepal

2011

1

v130

religion

dbl+lbl

0

6

1 - 96

Nepal

2016

1

v130

religion

dbl+lbl

0

6

1 - 96

Nepal

2022

1

v130

religion

dbl+lbl

0

6

1 - 96

Nepal

1996

2

v131

ethnicity

dbl+lbl

0

13

0 - 12

Nepal

2001

2

v131

ethnicity

dbl+lbl

0

56

1 - 96

Nepal

2006

2

v131

ethnicity

dbl+lbl

0

75

1 - 96

Nepal

2011

2

v131

ethnicity

dbl+lbl

0

11

1 - 996

Nepal

2016

2

v131

ethnicity

dbl+lbl

0

11

1 - 96

Nepal

2022

2

v131

ethnicity

dbl+lbl

0

11

1 - 96

Nepal

1996

3

slangn

native language of respondent

dbl+lbl

5

6

1 - 5

Nepal

2001

3

slangr

home language of respondent

dbl+lbl

0

5

1 - 5

Nepal

2006

3

snlang

native language of respondent

dbl+lbl

0

5

1 - 5

Nepal

2011

3

slnative

native language of respondent

dbl+lbl

0

4

1 - 6

Nepal

2016

3

v045c

native language of respondent

dbl+lbl

0

5

1 - 5

Nepal

2022

3

v045c

native language of respondent

dbl+lbl

0

5

1 - 6

The above table gives an overall snapshot of the social group variables. All the variables are of labelled class across all the npbr datasets. We see that all the variables have different number of value labels across the six npbr datasets. Note that, the religion and native language of respondent variable has some missing values in the npbr 1996 dataset. Next, we look at the variables individually for matching the value labels across the npbr datasets.

v130 - Religion of hh head

We check the value labels of the first social group variable v130, which gives the religion of household head. First we create a nested tibble of v130’s value labels.

# Create the data dictionary for v130 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v130 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v130) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v130)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v130", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 25: Data dictionary of v130 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v130

1

[1] hindu

[1] hindu

[1] hindu

[1] hindu

[1] hindu

[1] hindu

Nepal

v130

2

[2] buddhist

[2] buddhist

[2] buddhist

[2] buddhist

[2] buddhist

[2] buddhist

Nepal

v130

3

[3] muslim

[3] muslim

[3] mulsim

[3] muslim

[3] muslim

[3] muslim

Nepal

v130

4

[4] christian

[4] christian

[4] kirat

[4] kirat

[4] kirat

[4] kirat

Nepal

v130

5

[5] other

[5] christian

[5] christian

[5] christian

[5] christian

Nepal

v130

6

[6] other

[6] other

Nepal

v130

96

[96] other

[96] other

[96] other

Evidently, the values labels and codes for v130 are different across all the npbr rounds. Only the first three value labels “hindu”, “buddhist” and “muslim” and their label codes are same across all the npbr rounds. Therefore, we will work with these labels for harmonization.
NOTE: The labels “christian” and “other” are also present but their label codes vary across the npbr rounds.

v131 - Ethnicity of hh head

Next, we check the value labels of the v130 variable, which gives the ethnicity of household head. First we create a nested tibble of v131’s value labels.

# Create the data dictionary for v131 in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_v131 = map(
    npbr_data,
    \(df) {
      df |> 
        select(v131) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v131)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v131", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 26: Data dictionary of v131 across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

v131

0

[0] brahmin

Nepal

v131

1

[1] chhetri

[1] yadav ahir

[1] chhetri

[1] hill brahmin

[1] hill brahmin

[1] hill brahmin

Nepal

v131

2

[2] newar

[2] kayastha

[2] brahmin

[2] hill chhetri

[2] hill chhetri

[2] hill chhetri

Nepal

v131

3

[3] gurung

[3] kumhar

[3] magar

[3] terai brahmin/chhetri

[3] terai brahmin/chhetri

[3] terai brahmin/chhetri

Nepal

v131

4

[4] magar

[4] baniya

[4] tharu

[4] other terai caste

[4] other terai caste

[4] other terai caste

Nepal

v131

5

[5] tamang

[5] dhobi

[5] tamang

[5] hill dalit

[5] hill dalit

[5] hill dalit

Nepal

v131

6

[6] rai, limbu

[6] sundhi kalwar

[6] newar

[6] terai dalit

[6] terai dalit

[6] terai dalit

Nepal

v131

7

[7] muslim, churaute

[7] kurmi

[7] muslim

[7] newar

[7] newar

[7] newar

Nepal

v131

8

[8] tharu, rajbanshi

[8] brahman

[8] kami

[8] hill janajati

[8] hill janajati

[8] hill janajati

Nepal

v131

9

[9] yadav, ahir

[9] rajput

[9] yadav

[9] terai janajati

[9] terai janajati

[9] terai janajati

Nepal

v131

10

[10] occupational

[10] tharu

[10] rai

[10] muslim

[10] muslim

[10] muslim

Nepal

v131

11

[11] other hill origin

[11] teli

[11] gurung

Nepal

v131

12

[12] other terai origin

[12] kushwaha

[12] damai/dholi

Nepal

v131

13

[13] musalman

[13] limbu

Nepal

v131

14

[14] haluwai

[14] thakuri

Nepal

v131

15

[15] malaha

[15] sharki

Nepal

v131

16

[16] rajbanshi

[16] teli

Nepal

v131

17

[17] dhimal

[17] chamar

Nepal

v131

18

[18] gangai

[18] koiri

Nepal

v131

19

[19] marwadi

[19] kurmi

Nepal

v131

20

[20] bangali

[20] sanyasi

Nepal

v131

21

[21] dhanuk

[21] dhanuk

Nepal

v131

22

[22] shikha

[22] mushahar

Nepal

v131

23

[23] dushad

[23] dushad

Nepal

v131

24

[24] chamar

[24] sherpa

Nepal

v131

25

[25] khatwe

[25] sonar

Nepal

v131

26

[26] bhumihar

[26] kewat

Nepal

v131

27

[27] kewat

[27] brahmin (terai)

Nepal

v131

28

[28] rajbhar

[28] baniya

Nepal

v131

29

[29] kanu

[29] gharti/bhujel

Nepal

v131

30

[30] tarai others

[30] malaha

Nepal

v131

31

[31] brahman

[31] kalwar

Nepal

v131

32

[32] chhetri

[32] kumal

Nepal

v131

33

[33] thakuri

[33] hajam

Nepal

v131

34

[34] sanyashi

[34] kanu

Nepal

v131

35

[35] newar

[35] rajbanshi

Nepal

v131

36

[36] limbu

[36] sunuwar

Nepal

v131

37

[37] rai

[37] sundi

Nepal

v131

38

[38] gurung

[38] lohar

Nepal

v131

39

[39] thakali

[39] tatma

Nepal

v131

40

[40] tamang

[40] khatwe

Nepal

v131

41

[41] magar

[41] dhobi

Nepal

v131

42

[42] danuwar

[42] majhi

Nepal

v131

43

[43] jirel

[43] nuniya

Nepal

v131

44

[44] majhi

[44] kumhar

Nepal

v131

45

[45] sunuwar

[45] dunuwar

Nepal

v131

46

[46] gaine

[46] chepang/praja

Nepal

v131

47

[47] chepang

[47] haluwai

Nepal

v131

48

[48] kumhal

[48] rajput

Nepal

v131

49

[49] churaute (pahadi musalman)

[49] kayastha

Nepal

v131

50

[50] bote

[50] badahi

Nepal

v131

51

[51] lepcha

[51] marwadi

Nepal

v131

52

[52] raute

[52] santhal/satar

Nepal

v131

53

[53] darai

[53] dangad/jhangad

Nepal

v131

54

[54] raji

[54] bantar

Nepal

v131

55

[55] thami

[55] barai

Nepal

v131

56

[56] damai

[56] kahar

Nepal

v131

57

[57] kami

[57] gangai

Nepal

v131

58

[58] sharki

[58] lodha

Nepal

v131

59

[59] badi

[59] rajbhar

Nepal

v131

60

[60] pahadi others

[60] thami

Nepal

v131

61

[61] sherpa

[61] dhimal

Nepal

v131

62

[62] mugrali/humli/kar bhote

[62] bhote

Nepal

v131

63

[63] himali others

[63] bing/binda

Nepal

v131

64

[64] bhedihar/gaderi

Nepal

v131

65

[65] nurang

Nepal

v131

66

[66] yakha

Nepal

v131

67

[67] darai

Nepal

v131

68

[68] tajpuriya

Nepal

v131

69

[69] thakali

Nepal

v131

70

[70] chidimar

Nepal

v131

71

[71] pahadi

Nepal

v131

72

[72] mali

Nepal

v131

73

[73] bangali

Nepal

v131

74

[74] chantel

Nepal

v131

75

[75] dom

Nepal

v131

76

[76] kamar

Nepal

v131

77

[77] bote

Nepal

v131

78

[78] dbrahmu/baramu

Nepal

v131

79

[79] gainai

Nepal

v131

80

[80] jirel

Nepal

v131

81

[81] aadibasi

Nepal

v131

82

[82] dura

Nepal

v131

83

[83] churaute

Nepal

v131

84

[84] badi

Nepal

v131

85

[85] meche

Nepal

v131

86

[86] lepcha

Nepal

v131

87

[87] halkhor

Nepal

v131

88

[88] panjabi/sihk

Nepal

v131

89

[89] kisan

Nepal

v131

90

[90] bhumihar

Nepal

v131

91

[91] kushawa

Nepal

v131

92

[92] hayu

Nepal

v131

93

[93] koche

Nepal

v131

94

[94] dhuniya

Nepal

v131

95

[95] walung

Nepal

v131

96

[96] others

[96] other caste*

[96] other

[96] other

Nepal

v131

98

[98] do not know

Nepal

v131

996

[996] other

Similar to v130, the values labels and codes for v131 are different across all the npbr rounds. Notably, npbr 2001 and 2006 have more than 60 ethnicity categories. Unfortunately, we do not know how to group these categories. Therefore, we might not use this variable as a social group characteristic.

Native language of hh respondent

Next, we check the value labels of the native language of hh respondent variable. The variable names of this variable differs across the npbr datasets. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
npbr1_pre_tmp1 <- npbr1_pre_tmp0 |> 
  mutate(lookfor_lang = map(
    npbr_data,
    \(df) {
      df |> 
        select(matches("slang[nr]|snlang|slnative|v045c")) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
npbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
npbr1_pre_tmp2 <- npbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, npbr_data, n_births)) |> 
  unnest(cols = c(lookfor_lang)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "npbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "Native language", .before = 2)

# Convert the tibble to flextable for easy viewing
npbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 27: Data dictionary of native language of respondent variable across the npbr rounds

ctr_name

var_name

label_num

npbr_1996

npbr_2001

npbr_2006

npbr_2011

npbr_2016

npbr_2022

Nepal

Native language

1

[1] nepali

[1] nepali

[1] nepali

[1] nepali

[1] english

[1] english

Nepal

Native language

2

[2] bhojpuri

[2] bhojpuri

[2] bhojpuri

[2] bhojpuri

[2] nepali

[2] nepali

Nepal

Native language

3

[3] maithili

[3] maithili

[3] maithili

[3] maithili

[3] maithili

[3] maithali

Nepal

Native language

4

[4] tharu

[4] tharu

[4] tharu

[4] bhojpuri

[4] bhojpuri

Nepal

Native language

5

[5] other

[5] other

[5] other

[5] english

[5] other

Nepal

Native language

6

[6] other

[6] other

Nepal

Native language

9

[9] missing

The values labels are same for npbr 1996 and 2001, and then they vary for the other datasets. The value labels “nepali”, “bhojpuri” and “maithili” are same across all the npbr rounds hut their labels code are different. Therefore, we will use these labels for harmonization.

Nepal HH dataset use for variable creation

Checking the ID variables before harmonization

Here we check the formatting of the constituent variables with which we will prepare the ID variables for the pooled Nepal household recode (hr) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all nphr datasets.
# First we create a data dictionary of the nphr datasets in nested tibble.
nphr1_pre_tmp1 <- nphr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(
    nphr_data,
    \(df) {
      df |> 
        select(hv001, hv002) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
nphr1_pre_tmp1
# Now we unnest the tibble and output the pooled data dictionary 
nphr1_pre_tmp2 <- nphr1_pre_tmp1 |> 
  select(c(ctr_name, svy_year, lookfor_idvars)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
nphr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 30: Data dictionary of variables to be used for ID creation across the nphr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Nepal

1996

1

hv001

cluster number

dbl

0

253

101 - 7502

Nepal

2001

1

hv001

cluster number

dbl

0

251

101 - 7502

Nepal

2006

1

hv001

cluster number

dbl

0

260

101 - 7502

Nepal

2011

1

hv001

cluster number

dbl

0

289

101 - 7502

Nepal

2016

1

hv001

cluster number

dbl

0

383

1 - 383

Nepal

2022

1

hv001

cluster number

dbl

0

476

1 - 476

Nepal

1996

2

hv002

household number

dbl

0

488

1 - 774

Nepal

2001

2

hv002

household number

dbl

0

605

1 - 9006

Nepal

2006

2

hv002

household number

dbl

0

568

1 - 1319

Nepal

2011

2

hv002

household number

dbl

0

636

1 - 1403

Nepal

2016

2

hv002

household number

dbl

0

427

1 - 963

Nepal

2022

2

hv002

household number

dbl

0

338

1 - 506

From the above we can see that both the hv001 and hv002 are of numeric class with no missing values. These variables can be used for preparing the ID variables after finding the maximum length of their largest value. Note that survey year is also a constituent ID variable of 4-digits and we need not check it.

# We thought to process the above nested tibble further by decomposing the 
# "range" col into min and max values using separate_wider_regex().
# However, we hit a roadblock as pattern did not identify the max values in 
# some nphr rounds correctly
nphr1_pre_tmp3 <- nphr1_pre_tmp0 |> 
  # Generate the summary stats for id vars
  mutate(skim_idvars = map(nphr_data, \(df) {
    df |> 
      select(hv001, hv002) |> 
      skim_without_charts()
  })) |> 
  # Pool the summary stats for all nphr rounds
  select(c(ctr_name, svy_year, skim_idvars)) |> 
  unnest(cols = c(skim_idvars)) |> 
  arrange(skim_variable, svy_year) |> 
  # Group and generate the max and min values for each variable
  group_by(variable = skim_variable) |> 
  summarize(
    min_val = min(numeric.p0),
    max_val = max(numeric.p100)
  ) |> 
  # calculate the num of digits in the maximum values
  mutate(
    max_digits = nchar(as.character(max_val))
  ) |>
  # add variable labels and relocate it after variable name
  bind_cols(vlabel = c("cluster number", "household number")) |>
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
nphr1_pre_tmp3 |>
  qflextable() |>
  align(align = "left", part = "all") |>
  autofit()
Table 31: The maximum length of constituent ID variables to be set across the nphr rounds

variable

vlabel

min_val

max_val

max_digits

hv001

cluster number

1

7502

4

hv002

household number

1

9006

4

The above table gives the required length of the constituent ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

Checking HH-level variables before harmonization

Here we check the ecological region and wealth quintile variables before harmonizing them. Note in Nepal 1996 and 2001 the wealth quintile variables are provided in separate datasets. Therefore we join those variables to the hh file before proceeding with the checking.

Upon manually checking the full data dictionaries we find the variable names. Now we will check the data dictionary of these hh-level variables. Then we will check their value labels variable wise.

# We check the hh-level vars in all nphr datasets.
# First we create the data dictionary in nested tibble.
nphr1_pre_tmp1 <- nphr1_pre_tmp0 |>
  mutate(lookfor_hhvars = map(
    nphr_data,
    \(df) {
      df |> 
        # select the common independent variables
        select(
          matches("^wlthind5$|^hv270$"), 
          matches("shez|shreg1|shecoreg")
        ) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
nphr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
nphr1_pre_tmp2 <- nphr1_pre_tmp1 |> 
  select(c(svy_year, lookfor_hhvars)) |> 
  unnest(cols = c(lookfor_hhvars)) |> 
  arrange(pos, svy_year)

# Convert the tibble to flextable for easy viewing
nphr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 32: Data dictionary of hh-level variables across the nphr rounds

svy_year

pos

variable

label

col_type

missing

unique_values

range

1996

1

wlthind5

quintiles of wealth index

dbl+lbl

0

5

1 - 5

2001

1

wlthind5

quintiles of wealth index

dbl+lbl

0

5

1 - 5

2006

1

hv270

wealth index

dbl+lbl

0

5

1 - 5

2011

1

hv270

wealth index

dbl+lbl

0

5

1 - 5

2016

1

hv270

wealth index combined

dbl+lbl

0

5

1 - 5

2022

1

hv270

wealth index combined

dbl+lbl

0

5

1 - 5

1996

2

shez

hh ecozone

dbl+lbl

0

3

0 - 2

2001

2

shreg1

ecological region

dbl+lbl

0

3

1 - 3

2006

2

shreg1

ecological zone

dbl+lbl

0

3

1 - 3

2011

2

shecoreg

ecological region

dbl+lbl

0

3

1 - 3

2016

2

shecoreg

ecological zone

dbl+lbl

0

3

1 - 3

2022

2

shecoreg

ecological region

dbl+lbl

0

3

1 - 3

The above table gives an overall snapshot of the hh-level variables. All the variables are of labelled class and have the same number of value labels across all the nphr datasets. Note that, the ecological region variable has a different value label code in the npbr 1996 dataset. Next, we compare the value labels of the variables across the nphr datasets.

Ecological region variable

Next, we check the value labels of the native language of hh respondent variable. The variable names of this variable differs across the npbr datasets. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
nphr1_pre_tmp1 <- nphr1_pre_tmp0 |> 
  mutate(lookfor_ecoreg = map(
    nphr_data,
    \(df) {
      df |> 
        select(matches("shez|shreg1|shecoreg")) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
nphr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
nphr1_pre_tmp2 <- nphr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_ecoreg)) |> 
  unnest(cols = c(lookfor_ecoreg)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "nphr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "Ecological region", .before = 2)

# Convert the tibble to flextable for easy viewing
nphr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 33: Data dictionary of ecological region variable across the nphr rounds

ctr_name

var_name

label_num

nphr_1996

nphr_2001

nphr_2006

nphr_2011

nphr_2016

nphr_2022

Nepal

Ecological region

0

[0] mountain

Nepal

Ecological region

1

[1] hill

[1] mountain

[1] mountain

[1] mountain

[1] mountain

[1] mountain

Nepal

Ecological region

2

[2] terai

[2] hill

[2] hill

[2] hill

[2] hill

[2] hill

Nepal

Ecological region

3

[3] terai

[3] terai

[3] terai

[3] terai

[3] terai

Clearly, the value label texts are same for all nphr rounds. However, the value label code is different in nphr 1996 (codes 0-2) when compared to the rest of nphr rounds (codes 1-2). Therefore, we need to be mindful of this during harmonization.

Wealth index quintile variable

Next, we check the value labels of the household wealth quintile variable. The variable names of this variable differs across the npbr datasets. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
nphr1_pre_tmp1 <- nphr1_pre_tmp0 |> 
  mutate(lookfor_wiqt = map(
    nphr_data,
    \(df) {
      df |> 
        select(matches("^wlthind5$|^hv270$")) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
nphr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
nphr1_pre_tmp2 <- nphr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_wiqt)) |> 
  unnest(cols = c(lookfor_wiqt)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "nphr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "Wealth index quintiles", .before = 2)

# Convert the tibble to flextable for easy viewing
nphr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 34: Data dictionary of wealth quintiles across the nphr rounds

ctr_name

var_name

label_num

nphr_1996

nphr_2001

nphr_2006

nphr_2011

nphr_2016

nphr_2022

Nepal

Wealth index quintiles

1

[1] lowest quintile

[1] lowest quintile

[1] poorest

[1] poorest

[1] poorest

[1] poorest

Nepal

Wealth index quintiles

2

[2] second quintile

[2] second quintile

[2] poorer

[2] poorer

[2] poorer

[2] poorer

Nepal

Wealth index quintiles

3

[3] middle quintile

[3] middle quintile

[3] middle

[3] middle

[3] middle

[3] middle

Nepal

Wealth index quintiles

4

[4] fourth quintile

[4] fourth quintile

[4] richer

[4] richer

[4] richer

[4] richer

Nepal

Wealth index quintiles

5

[5] highest quintile

[5] highest quintile

[5] richest

[5] richest

[5] richest

[5] richest

Clearly, the value label codes are same in all nphr rounds. However, the value label texts are different in nphr 1996 and 2001, compared to the nphr 2006, 2011, 2016 and 2022 rounds. Therefore, we need to be mindful of this during harmonization.

Nepal PR dataset use for family structure variables creation

Checking the ID variables before harmonization

Here we check the formatting of the constituent variables with which we will prepare the ID variables for the pooled Nepal person recode (pr) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all nppr datasets.
# First we create a data dictionary of the nppr datasets in nested tibble.
nppr1_pre_tmp1 <- nppr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(nppr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
nppr1_pre_tmp1
# Now we unnest the tibble and output the pooled data dictionary 
nppr1_pre_tmp2 <- nppr1_pre_tmp1 |> 
  select(c(ctr_name, svy_year, lookfor_idvars)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
nppr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 35: Data dictionary of variables to be used for ID creation across the nppr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Nepal

1996

1

hv001

cluster number

dbl

0

253

101 - 7502

Nepal

2001

1

hv001

cluster number

dbl

0

251

101 - 7502

Nepal

2006

1

hv001

cluster number

dbl

0

260

101 - 7502

Nepal

2011

1

hv001

cluster number

dbl

0

289

101 - 7502

Nepal

2016

1

hv001

cluster number

dbl

0

383

1 - 383

Nepal

2022

1

hv001

cluster number

dbl

0

476

1 - 476

Nepal

1996

2

hv002

household number

dbl

0

488

1 - 774

Nepal

2001

2

hv002

household number

dbl

0

605

1 - 9006

Nepal

2006

2

hv002

household number

dbl

0

568

1 - 1319

Nepal

2011

2

hv002

household number

dbl

0

636

1 - 1403

Nepal

2016

2

hv002

household number

dbl

0

427

1 - 963

Nepal

2022

2

hv002

household number

dbl

0

338

1 - 506

Nepal

1996

3

hvidx

line number

dbl

0

42

1 - 42

Nepal

2001

3

hvidx

line number

dbl

0

29

1 - 29

Nepal

2006

3

hvidx

line number

dbl

0

30

1 - 30

Nepal

2011

3

hvidx

line number

dbl

0

31

1 - 31

Nepal

2016

3

hvidx

line number

dbl

0

38

1 - 38

Nepal

2022

3

hvidx

line number

dbl

0

26

1 - 26

From the above table we can see that all the three constituent ID variables are of numeric class with no missing values. These variables can directly be used for preparing the ID variables after finding the maximum length of their largest value. Note that survey year is also a constituent ID variable of 4-digits and we need not check it.

# We thought to process the above nested tibble further by decomposing the 
# "range" col into min and max values using separate_wider_regex().
# However, we hit a roadblock as pattern did not identify the max values in 
# some nppr rounds correctly
nppr1_pre_tmp3 <- nppr1_pre_tmp0 |> 
  # Generate the summary stats for id vars
  mutate(skim_idvars = map(nppr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      skim_without_charts()
  })) |> 
  # Pool the summary stats for all nppr rounds
  select(c(ctr_name, svy_year, skim_idvars)) |> 
  unnest(cols = c(skim_idvars)) |> 
  arrange(skim_variable, svy_year) |> 
  # Group and generate the max and min values for each variable
  group_by(variable = skim_variable) |> 
  summarize(
    min_val = min(numeric.p0),
    max_val = max(numeric.p100)
  ) |> 
  # calculate the num of digits in the maximum values
  mutate(
    max_digits = nchar(as.character(max_val))
  ) |>
  # add variable labels and relocate it after variable name
  bind_cols(vlabel = c("cluster number", "household number", "Persons line number")) |>
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
nppr1_pre_tmp3 |>
  qflextable() |>
  align(align = "left", part = "all") |>
  autofit()
Table 36: The maximum length of constituent ID variables to be set across the nppr rounds

variable

vlabel

min_val

max_val

max_digits

hv001

cluster number

1

7502

4

hv002

household number

1

9006

4

hvidx

Persons line number

1

42

2

The above table gives the required length of the constituent ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

Checking Family structure variables before harmonization

Here we check the family structure related variables before harmonizing them. The variable names were collected by manually checking the full data dictionaries. Here we will check the data dictionary of these hh-level variables and focus on the variable types.

# We check the family structure vars in all nppr datasets.
# First we create the data dictionary in nested tibble.
nppr1_pre_tmp1 <- nppr1_pre_tmp0 |>
  mutate(lookfor_famstrvars = map(nppr_data, \(df) {
    df |> 
      # select the common independent variables
      select(c(hv101, hv102, hv103, hv104, hv105)) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
nppr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
nppr1_pre_tmp2 <- nppr1_pre_tmp1 |> 
  select(c(svy_year, lookfor_famstrvars)) |> 
  unnest(cols = c(lookfor_famstrvars)) |> 
  arrange(pos, svy_year)

# Convert the tibble to flextable for easy viewing
nppr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 37: Data dictionary of family structure vars across the nppr rounds

svy_year

pos

variable

label

col_type

missing

unique_values

range

1996

1

hv101

relationship to head

dbl+lbl

2

13

1 - 12

2001

1

hv101

relationship to head

dbl+lbl

0

12

1 - 12

2006

1

hv101

relationship to head

dbl+lbl

0

14

1 - 15

2011

1

hv101

relationship to head

dbl+lbl

0

12

1 - 12

2016

1

hv101

relationship to head

dbl+lbl

0

14

1 - 15

2022

1

hv101

relationship to head

dbl+lbl

0

14

1 - 15

1996

2

hv102

usual resident

dbl+lbl

0

2

0 - 1

2001

2

hv102

usual resident

dbl+lbl

0

2

0 - 1

2006

2

hv102

usual resident

dbl+lbl

0

2

0 - 1

2011

2

hv102

usual resident

dbl+lbl

0

2

0 - 1

2016

2

hv102

usual resident

dbl+lbl

0

2

0 - 1

2022

2

hv102

usual resident (househods with no de jure members)

dbl+lbl

0

2

0 - 1

1996

3

hv103

slept last night

dbl+lbl

8

3

0 - 1

2001

3

hv103

slept last night

dbl+lbl

0

2

0 - 1

2006

3

hv103

slept last night

dbl+lbl

0

2

0 - 1

2011

3

hv103

slept last night

dbl+lbl

0

2

0 - 1

2016

3

hv103

slept last night

dbl+lbl

0

2

0 - 1

2022

3

hv103

stayed last night

dbl+lbl

0

2

0 - 1

1996

4

hv104

sex of household member

dbl+lbl

0

2

1 - 2

2001

4

hv104

sex of household member

dbl+lbl

0

2

1 - 2

2006

4

hv104

sex of household member

dbl+lbl

0

2

1 - 2

2011

4

hv104

sex of household member

dbl+lbl

0

2

1 - 2

2016

4

hv104

sex of household member

dbl+lbl

0

2

1 - 2

2022

4

hv104

sex of household member

dbl+lbl

0

2

1 - 2

1996

5

hv105

age of household members

dbl+lbl

0

99

0 - 98

2001

5

hv105

age of household members

dbl+lbl

2

100

0 - 98

2006

5

hv105

age of household members

dbl+lbl

0

97

0 - 96

2011

5

hv105

age of household members

dbl+lbl

0

96

0 - 95

2016

5

hv105

age of household members

dbl+lbl

0

96

0 - 95

2022

5

hv105

age of household members

dbl+lbl

0

97

0 - 98

The above table gives an overall snapshot of the family structure related variables. Interestingly, all the variables including age of hh members (a continuous var) are of labelled class. The relation to head and de facto resident variables have few missing values in nppr 1996. Note that, the three variables of interest hv101-hv102, two variables hv101 and hv103 have different number of value labels across the nppr rounds. Next, we compare the value labels of the individual variables across the nppr datasets.

hv101 - Relationship to head

Next, we check the value labels of the relationship to the household head variable. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
nppr1_pre_tmp1 <- nppr1_pre_tmp0 |> 
  mutate(lookfor_hv101 = map(nppr_data, \(df) {
    df |> 
      select(hv101) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
nppr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
nppr1_pre_tmp2 <- nppr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv101)) |> 
  unnest(cols = c(lookfor_hv101)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "nppr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv101", .before = 2)

# Convert the tibble to flextable for easy viewing
nppr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 38: Data dictionary of relationship to head variable across the nppr rounds

ctr_name

var_name

label_num

nppr_1996

nppr_2001

nppr_2006

nppr_2011

nppr_2016

nppr_2022

Nepal

hv101

1

[1] head

[1] head

[1] head

[1] head

[1] head

[1] head

Nepal

hv101

2

[2] wife or husband

[2] wife or husband

[2] wife or husband

[2] wife or husband

[2] wife or husband

[2] wife or husband

Nepal

hv101

3

[3] son/daughter

[3] son/daughter

[3] son/daughter

[3] son/daughter

[3] son/daughter

[3] son/daughter

Nepal

hv101

4

[4] son/daughter-in-law

[4] son/daughter-in-law

[4] son/daughter-in-law

[4] son/daughter-in-law

[4] son/daughter-in-law

[4] son/daughter-in-law

Nepal

hv101

5

[5] grandchild

[5] grandchild

[5] grandchild

[5] grandchild

[5] grandchild

[5] grandchild

Nepal

hv101

6

[6] parent

[6] parent

[6] parent

[6] parent

[6] parent

[6] parent

Nepal

hv101

7

[7] parent-in-law

[7] parent-in-law

[7] parent-in-law

[7] parent-in-law

[7] parent-in-law

[7] parent-in-law

Nepal

hv101

8

[8] brother/sister

[8] brother/sister

[8] brother/sister

[8] brother/sister

[8] brother/sister

[8] brother/sister

Nepal

hv101

9

[9] co-spouse

[9] co-spouse

[9] co-spouse

[9] co-spouse

[9] co-spouse

[9] co-spouse

Nepal

hv101

10

[10] other relative

[10] other relative

[10] other relative

[10] other relative

[10] other relative

[10] other relative

Nepal

hv101

11

[11] adopted/foster child

[11] adopted/foster child

[11] adopted/foster child

[11] adopted/foster child

[11] adopted/foster child

[11] adopted/foster child

Nepal

hv101

12

[12] not related

[12] not related

[12] not related

[12] not related

[12] not related

[12] not related

Nepal

hv101

13

[13] niece/nephew by blood

[13] niece/nephew by blood

[13] niece/nephew

[13] niece/nephew by blood

Nepal

hv101

14

[14] niece/nephew by marriage

[14] niece/nephew by marriage

[14] niece/nephew by marriage

[14] niece/nephew by marriage

Nepal

hv101

15

[15] brother-in-law/sister-in-law

[15] brother/sister in law

[15] brother/sister in law

Nepal

hv101

98

[98] dk

[98] dk

[98] dk

[98] don't know

[98] don't know

[98] don't know

The above table shows that the value label texts vary across the nppr rounds. To harmonize the relationship to head variable we can use the following value labels -

  • 1 head
  • 2 spouse
  • 3 child
  • 4 child-in-law
  • 5 grandchild
  • 6 parent
  • 7 parent-in-law
  • 8 sibling
  • 9 others

Here, we merge the “spouse” and “co-spouse” categories into “spouse” category, and the “son/daughter” and “adopted/foster child” categories into “child” category.

hv102 - de jure/usual resident

Next, we check the value labels of the de jure resident variable. This means if a household member is an usual resident of the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
nppr1_pre_tmp1 <- nppr1_pre_tmp0 |> 
  mutate(lookfor_hv102 = map(nppr_data, \(df) {
    df |> 
      select(hv102) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
nppr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
nppr1_pre_tmp2 <- nppr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv102)) |> 
  unnest(cols = c(lookfor_hv102)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "nppr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv102", .before = 2)

# Convert the tibble to flextable for easy viewing
nppr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 39: Data dictionary of the De jure resident variable across the nppr rounds

ctr_name

var_name

label_num

nppr_1996

nppr_2001

nppr_2006

nppr_2011

nppr_2016

nppr_2022

Nepal

hv102

0

[0] no

[0] no

[0] no

[0] no

[0] no

[0] no

Nepal

hv102

1

[1] yes

[1] yes

[1] yes

[1] yes

[1] yes

[1] yes

The above table shows that hv102 has the same value label texts and codes across the nppr rounds. Therefore, we can use this variable directly after converting to factor type.

hv103 - de facto resident

Next, we check the value labels of the de facto resident variable. In DHS this means if a household member slept last night in the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
nppr1_pre_tmp1 <- nppr1_pre_tmp0 |> 
  mutate(lookfor_hv103 = map(nppr_data, \(df) {
    df |> 
      select(hv103) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
nppr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
nppr1_pre_tmp2 <- nppr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv103)) |> 
  unnest(cols = c(lookfor_hv103)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "nppr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv103", .before = 2)

# Convert the tibble to flextable for easy viewing
nppr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 40: Data dictionary of the De facto resident variable across the nppr rounds

ctr_name

var_name

label_num

nppr_1996

nppr_2001

nppr_2006

nppr_2011

nppr_2016

nppr_2022

Nepal

hv103

0

[0] no

[0] no

[0] no

[0] no

[0] no

[0] no

Nepal

hv103

1

[1] yes

[1] yes

[1] yes

[1] yes

[1] yes

[1] yes

The above table shows that hv103 has the same value label texts and codes across the nppr rounds. Therefore, we can use this variable directly after converting to factor type.

START FROM HERE

TASK:

  • Handling multiple births in death scarring vars may not be necessary.
  • Preceding birth interval construction has changed with DHS-7. We could re-construct it.

TO BE CONTINUED …

Back to top