AFDHS data pooling pre-checks

Getting started

First we load the required packages.

easypackages::libraries(
  # Data i/o
  "here",                 # relative file path
  "rio",                  # file import-export
  
  # Data manipulation
  "janitor",              # data cleaning fns
  "haven",                # stata, sas, spss data io
  "labelled",             # var labelling
  "readxl",               # excel sheets
  # "scales",               # to change formats and units
  "skimr",                # quick data summary
  "broom",                # view model results
  
  # Data analysis
  "DHS.rates",            # demographic rates for dhs-like surveys
  "GeneralOaxaca",        # BO decomposition for non-linear
  "survey",               # apply survey weights
  
  # Analysis output
  "gt",
  # "modelsummary",          # output summary tables
  "gtsummary",            # output summary tables
  "flextable",            # creating tables from objects
  "officer",              # editing in office docs
  
  # R graph related packages
  "ggstats",
  "RColorBrewer",
  # "scales",
  "patchwork",
  
  # Misc packages
  "tidyverse",            # Data manipulation iron man
  "tictoc"                # Code timing
)

Next we turn off scientific notations.

options(scipen = 999)

Next we set the default gtsummary print engine for tables.

theme_gtsummary_printer(print_engine = "flextable")

Now we set the flextable output defaults.

set_flextable_defaults(
  font.size = 11,
  text.align = "left",
  big.mark = "",
  background.color = "white",
  table.layout = "autofit",
  theme_fun = theme_vanilla
)

Document introduction

Here we document the variable codes and labels of variables across the Afghanistan Demographic and Health Survey (DHS) datasets. Note that till June 2025 we have one round of Afghanistan 2015 DHS available. While Afghanistan DHS 2020 was scheduled, it got delayed during COVID-19 and the data was never released. Also, Afghanistan DHS round will be used only in the pooled South Asia DHS dataset. Therefore, we will check the variable labels and codes of the 2015 round and if required compare with India’s DHS variables. Based on this, we run the data harmonization code in “daprep-v01_afdhs.R”.

We pool the following Afghanistan DHS surveys:

# Creating the table of surveys to be used for pooling
afbr1_tmp_intro |> 
  mutate(n_births = prettyNum(n_births, big.mark = ",")) |> 
  select(c(ctr_name, svy_year, n_births)) |> 
  # Join vars from afir_tmp_intro
  left_join(
    afir1_tmp_intro |> 
      mutate(n_women = prettyNum(n_women, big.mark = ",")) |> 
      select(c(year, n_women)),
    by = join_by(svy_year == year)
  ) |> 
  # Join vars from afhr_tmp_intro
  left_join(
    afhr1_tmp_intro |> 
      mutate(n_households = prettyNum(n_households, big.mark = ",")) |> 
      select(svy_year, n_households),
    by = join_by(svy_year)
  ) |> 
  # Join vars from afpr_tmp_intro
  left_join(
    afpr1_tmp_intro |> 
      mutate(n_persons = prettyNum(n_persons, big.mark = ",")) |> 
      select(svy_year, n_persons),
    by = join_by(svy_year)
  ) |> 
  # convert nested tibble to simple tibble
  unnest(cols = c()) |> 
  mutate(
    ccode = row_number(), 
    .before = ctr_name
  ) |> 
  # convert to flextable object
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 1: Afghanistan DHS datasets and their sample size to be used for pooling

ccode	ctr_name	svy_year	n_births	n_women	n_households	n_persons
1	Afghanistan	2015	125,715	29,461	24,395	203,708

We use the following variables for the pooled data analysis:

Dependent variable
- infantd = Index child died during infancy period (0-11 months)
Main Independent variable
- sibsurv_nmv = Survival status of preceding child (Death scarring)
- binterval_3c_nmv_opp = Birth interval preceding to index child
Independent variables [CHILD LEVEL]
- cyob10y_opp = Birth cohort of index child
- bord_c = Birth order of index child
- sex_fm = Sex of index child
- season = Season during birth
Independent variables [MOTHER/PARENT LEVEL]
- ~~myob_opp = Birth cohort of mother~~
- macb_c_opp = Mother’s age during birth of index child
- medu_opp = Mother’s Level of education
- fedu_opp = Father’s level of education
Independent variables [HOUSEHOLD LEVEL]
- religion = Religion
- nat_lang = Native language of respondent
- wi_qt_opp = Household wealth quintile
- ~~hhgen_2c_opp = Generations in household~~
- hhstruc_opp = Household structure
- head_sex_fm = Sex of HH head
Independent variables [COMMUNITY LEVEL]
- por = Place of residence of the household
- ecoreg = Ecological region

Note: (a) Crossed names indicates variable not included.

Data import

We will directly import the nested tibble here. The code for dataset preparation is in the “daprep-v01_afdhs.R” script file.

# Here we import the tibbles to be used for dataset checking
# Import the afbr nested tibble
afbr1_pre_tmp0 <- read_rds(file = here("website_data", "afbr1_nest0.rds"))
# Import the afhr nested tibble
afhr1_pre_tmp0 <- read_rds(file = here("website_data", "afhr1_nest0.rds"))
# Import the afpr nested tibble
afpr1_pre_tmp0 <- read_rds(file = here("website_data", "afpr1_nest0.rds"))

Afghanistan BR dataset use for variable creation

Checking the Women’s weight variable before harmonization

We will check the formatting of the v005 women’s weight variable before creating the pooled survey weight. For this we will use the labelled::look_for().

# First we create the data dictionary of v005 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v005 = map(afbr_data, \(df) {
    df |> 
      select(v005) |> 
      look_for(details = "full") |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character() |> 
      select(-c(levels:n_na))
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v005)) |> 
  select(-pos) 
# Convert and view the tibble as flextable
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 2: Data dictionary of v005 variable across the afbr rounds

ctr_name	svy_year	variable	label	col_type	missing	unique_values	range
Afghanistan	2015	v005	women's individual sample weight (6 decimals)	dbl	0	953	25644 - 21472656

The women’s weight variables are in numeric class and have no missing values. Therefore, we need not reformat them. Hence we directly use it for preparing the pooled survey weight.

Checking the ID variables before harmonization

Here we check the formatting of the variables using which we will prepare the ID variables for the pooled Afghanistan birth history recode (BR) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all afbr datasets.
# First we create a data dictionary of the afbr datasets in nested tibble.
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(afbr_data, \(df) {
    df |> 
      select(v001, v002, v003, bord, v021, v022, v023, v024) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and output the pooled data dictionary 
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 3: Data dictionary of variables to be used for ID creation across the afbr rounds

ctr_name	svy_year	pos	variable	label	col_type	unique_values	range
Afghanistan	2015	1	v001	cluster number	dbl	956	1 - 999
Afghanistan	2015	2	v002	household number	dbl	344	1 - 436
Afghanistan	2015	3	v003	respondent's line number	dbl	40	1 - 41
Afghanistan	2015	4	bord	birth order number	dbl	17	1 - 17
Afghanistan	2015	5	v021	primary sampling unit	dbl	956	1 - 999
Afghanistan	2015	6	v022	sample strata for sampling errors	dbl+lbl	63	1 - 68
Afghanistan	2015	7	v023	stratification used in sample design	dbl+lbl	63	1 - 68
Afghanistan	2015	8	v024	region	dbl+lbl	34	1 - 34

From the above we can see that v023 and v024 are of labelled class, while the rest are in numeric class. Therefore, we will check the numeric and labelled variables in different ways. Note that although survey year is a constituent ID variable we have not checked it. It is imperative that survey year would be a 4-digit variable.

Numeric ID variables check

First, let’s find out the required length of the numeric ID variables by checking the maximum values of the constituent ID variable across the Afghanistan DHS datasets. Here we estimate the summary stats of numeric constituent variables using skim_without_charts().

# Check the summary stats for ID vars using skimr in each afbr dataset.
# First we estimate the summary stats using skim_without_charts().
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(skim_id_num = map(afbr_data, function(df) {
    df |> 
      select(v001, v002, v003, bord, v021, v022) |> 
      skim_without_charts() |> 
      as_tibble() |> 
      select(-c(skim_type, n_missing, complete_rate)) |> 
      rename(
        variable = 1,
        mean = 2,
        sd = 3,
        min = 4,
        p25 = 5,
        p50 = 6,
        p75 = 7,
        max = 8
      )
  }))
afbr1_pre_tmp1

Next, we check the summary stats of numeric variables by variable name-wise.

# Now we unnest the nested tibble so that we can compare the variable length 
# across the afbr datasets.
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(skim_id_num)) |> 
  arrange(variable, svy_year) |> 
  # change the decimal places of selected variables
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd),
    p75 = sprintf("%.0f", p75)
  )
# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 4: Summary statistics of the numeric ID variables

ctr_name	svy_year	variable	mean	sd	min	p25	p50	p75	max
Afghanistan	2015	bord	3.6	2.4	1	2	3	5	17
Afghanistan	2015	v001	499.3	285.1	1	260	477	749	999
Afghanistan	2015	v002	75.6	59.5	1	29	61	109	436
Afghanistan	2015	v003	3.1	2.8	1	2	2	2	41
Afghanistan	2015	v021	499.3	285.1	1	260	477	749	999
Afghanistan	2015	v022	41.8	18.2	1	31	46	56	68

Now we find out the required length of the numeric ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the numeric ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

# Processing the above nested tibble further
afbr1_pre_tmp3 <- afbr1_pre_tmp2 |> 
  group_by(variable) |> 
  # find the minimum and maximum values across surveys 
  summarize(
    min_val = min(min),
    max_val = max(max)
  ) |> 
  mutate(
    # calculate the num of digits in the maximum values
    max_digits = nchar(as.character(max_val)),
    # convert char var to factor
    variable = fct(
      variable, 
      levels = c("v001", "v002", "v003", "bord", "v021", "v022")
    )
  ) |> 
  # sort the rows by factor levels 
  arrange(variable) |> 
  # add variable labels and relocate it after variable name.
  bind_cols(vlabel = c("cluster number", "household number", 
                       "respondent's line number", "birth order", 
                       "primary sampling unit", "sample strata for se")) |> 
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp3 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 5: The maximum length of numeric variables to be set across the afbr rounds for concatenating the ID variables

variable	vlabel	min_val	max_val	max_digits
v001	cluster number	1	999	3
v002	household number	1	436	3
v003	respondent's line number	1	41	2
bord	birth order	1	17	2
v021	primary sampling unit	1	999	3
v022	sample strata for se	1	68	2

Labelled ID variables check

First we check the labels in sub-national region variable coded as v024 across the afbr datasets. Let’s create a nested tibble of v024’s value labels.

# Create the data dictionary for v024 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v024 = map(afbr_data, \(df) {
    df |> 
      select(v024) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

Now we view the value labels of v024 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary 
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v024)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |> 
  # Show the variable name in a col
  mutate(var_name = "v024", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 6: Data dictionary of v024 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v024	1	[1] kabul
Afghanistan	v024	2	[2] kapisa
Afghanistan	v024	3	[3] parwan
Afghanistan	v024	4	[4] wardak
Afghanistan	v024	5	[5] logar
Afghanistan	v024	6	[6] nangarhar
Afghanistan	v024	7	[7] laghman
Afghanistan	v024	8	[8] panjsher
Afghanistan	v024	9	[9] baghlan
Afghanistan	v024	10	[10] bamyan
Afghanistan	v024	11	[11] ghazni
Afghanistan	v024	12	[12] paktika
Afghanistan	v024	13	[13] paktya
Afghanistan	v024	14	[14] khost
Afghanistan	v024	15	[15] kunarha
Afghanistan	v024	16	[16] nooristan
Afghanistan	v024	17	[17] badakhshan
Afghanistan	v024	18	[18] takhar
Afghanistan	v024	19	[19] kunduz
Afghanistan	v024	20	[20] samangan
Afghanistan	v024	21	[21] balkh
Afghanistan	v024	22	[22] sar-e-pul
Afghanistan	v024	23	[23] ghor
Afghanistan	v024	24	[24] daykundi
Afghanistan	v024	25	[25] urozgan
Afghanistan	v024	26	[26] zabul
Afghanistan	v024	27	[27] kandahar
Afghanistan	v024	28	[28] jawzjan
Afghanistan	v024	29	[29] faryab
Afghanistan	v024	30	[30] helmand
Afghanistan	v024	31	[31] badghis
Afghanistan	v024	32	[32] herat
Afghanistan	v024	33	[33] farah
Afghanistan	v024	34	[34] nimroz

NOTE: In line with our data preparation framework for other South Asian DHS datasets, we will not use the region var in preparing the ID var.

Secondly, we check the labels in v023 variable that denotes the stratifications used for sampling design. First we create a nested tibble of v023’s value labels.

# Create the data dictionary for v023 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v023 = map(afbr_data, \(df) {
    df |> 
      select(v023) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

Now we view the value labels of v023 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v023)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v023", .before = 2) 

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 7: Data dictionary of v023 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v023	1	[1] kabul urban
Afghanistan	v023	2	[2] kapisa urban
Afghanistan	v023	3	[3] parwan urban
Afghanistan	v023	4	[4] wardak urban
Afghanistan	v023	5	[5] logar urban
Afghanistan	v023	6	[6] nangarhar urban
Afghanistan	v023	7	[7] laghman urban
Afghanistan	v023	8	[8] panjsher urban
Afghanistan	v023	9	[9] baghlan urban
Afghanistan	v023	10	[10] bamyan urban
Afghanistan	v023	11	[11] ghazni urban
Afghanistan	v023	12	[12] paktika urban
Afghanistan	v023	13	[13] paktya urban
Afghanistan	v023	14	[14] khost urban
Afghanistan	v023	15	[15] kunarha urban
Afghanistan	v023	16	[16] nooristan urban
Afghanistan	v023	17	[17] badakhshan urban
Afghanistan	v023	18	[18] takhar urban
Afghanistan	v023	19	[19] kunduz urban
Afghanistan	v023	20	[20] samangan urban
Afghanistan	v023	21	[21] balkh urban
Afghanistan	v023	22	[22] sar-e-pul urban
Afghanistan	v023	23	[23] ghor urban
Afghanistan	v023	24	[24] daykundi urban
Afghanistan	v023	25	[25] urozgan urban
Afghanistan	v023	26	[26] zabul urban
Afghanistan	v023	27	[27] kandahar urban
Afghanistan	v023	28	[28] jawzjan urban
Afghanistan	v023	29	[29] faryab urban
Afghanistan	v023	30	[30] helmand urban
Afghanistan	v023	31	[31] badghis urban
Afghanistan	v023	32	[32] herat urban
Afghanistan	v023	33	[33] farah urban
Afghanistan	v023	34	[34] nimroz urban
Afghanistan	v023	35	[35] kabul rural
Afghanistan	v023	36	[36] kapisa rural
Afghanistan	v023	37	[37] parwan rural
Afghanistan	v023	38	[38] wardak rural
Afghanistan	v023	39	[39] logar rural
Afghanistan	v023	40	[40] nangarhar rural
Afghanistan	v023	41	[41] laghman rural
Afghanistan	v023	42	[42] panjsher rural
Afghanistan	v023	43	[43] baghlan rural
Afghanistan	v023	44	[44] bamyan rural
Afghanistan	v023	45	[45] ghazni rural
Afghanistan	v023	46	[46] paktika rural
Afghanistan	v023	47	[47] paktya rural
Afghanistan	v023	48	[48] khost rural
Afghanistan	v023	49	[49] kunarha rural
Afghanistan	v023	50	[50] nooristan rural
Afghanistan	v023	51	[51] badakhshan rural
Afghanistan	v023	52	[52] takhar rural
Afghanistan	v023	53	[53] kunduz rural
Afghanistan	v023	54	[54] samangan rural
Afghanistan	v023	55	[55] balkh rural
Afghanistan	v023	56	[56] sar-e-pul rural
Afghanistan	v023	57	[57] ghor rural
Afghanistan	v023	58	[58] daykundi rural
Afghanistan	v023	59	[59] urozgan rural
Afghanistan	v023	60	[60] zabul rural
Afghanistan	v023	61	[61] kandahar rural
Afghanistan	v023	62	[62] jawzjan rural
Afghanistan	v023	63	[63] faryab rural
Afghanistan	v023	64	[64] helmand rural
Afghanistan	v023	65	[65] badghis rural
Afghanistan	v023	66	[66] herat rural
Afghanistan	v023	67	[67] farah rural
Afghanistan	v023	68	[68] nimroz rural

NOTE: In line with our data preparation framework for other South Asian DHS datasets, we will not use v023 in the ID variable preparation.

Checking the Birth History variables before harmonization

Undoubtedly the birth history variables are important for this study objective. Therefore, we need to scrutinize all the birth history variables before using them to prepare harmonized variables for the pooled dataset.

# We check the birth history vars in all afbr datasets.
# First we create a data dictionary in nested tibble.
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |>
  mutate(lookfor_bhvars = map(afbr_data, \(df) {
    df |> 
      select(bidx, matches("^b[0-9]+")) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_bhvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 8: Data dictionary of birth history variables across the afbr rounds

ctr_name	svy_year	pos	variable	label	col_type	missing	unique_values	range
Afghanistan	2015	1	bidx	birth column number	dbl	0	17	1 - 17
Afghanistan	2015	2	b0	child is twin	dbl+lbl	0	4	0 - 3
Afghanistan	2015	3	b1	month of birth	dbl	0	12	1 - 12
Afghanistan	2015	4	b2	year of birth	dbl	0	39	1356 - 1394
Afghanistan	2015	5	b3	date of birth (cmc)	dbl	0	441	675 - 1138
Afghanistan	2015	6	b4	sex of child	dbl+lbl	0	2	1 - 2
Afghanistan	2015	7	b5	child is alive	dbl+lbl	0	2	0 - 1
Afghanistan	2015	8	b6	age at death	dbl+lbl	115605	94	100 - 399
Afghanistan	2015	9	b7	age at death (months, imputed)	dbl	115566	57	0 - 396
Afghanistan	2015	10	b8	current age of child	dbl	10149	40	0 - 38
Afghanistan	2015	11	b9	child lives with whom	dbl+lbl	10149	3	0 - 4
Afghanistan	2015	12	b10	completeness of information	dbl+lbl	0	7	1 - 8
Afghanistan	2015	13	b11	preceding birth interval (months)	dbl	26798	180	9 - 238
Afghanistan	2015	14	b12	succeeding birth interval (months)	dbl	26971	180	9 - 238
Afghanistan	2015	15	b13	flag for age at death	dbl+lbl	115566	6	0 - 8
Afghanistan	2015	16	b15	live birth between births	dbl+lbl	26850	3	0 - 1
Afghanistan	2015	17	b16	child's line number in household	dbl+lbl	10149	48	0 - 48
Afghanistan	2015	18	b17	na - day of birth	dbl	125715	1
Afghanistan	2015	19	b18	na - century day code of birth (cdc)	dbl	125715	1
Afghanistan	2015	20	b19	na - current age of child in months	dbl	125715	1
Afghanistan	2015	21	b20	na - duration of pregnancy	dbl	125715	1

From the above table we get an overall snapshot of the birth history variables. The birth history variables are similar to other South Asian DHS rounds.Next, we look at the other labelled variables which are common across afbr in more details. We would like to see if the value labels of the common birth history variables are similar across the afbr datasets.

b0 - child is twin

We check the value labels of b0 variable that denotes whether the child is twin. First we create a nested tibble of b0’s value labels.

# Create the data dictionary for b0 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_b0 = map(afbr_data, \(df) {
    df |> 
      select(b0) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b0)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b0", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 9: Data dictionary of b0 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	b0	0	[0] single birth
Afghanistan	b0	1	[1] 1st of multiple
Afghanistan	b0	2	[2] 2nd of multiple
Afghanistan	b0	3	[3] 3rd of multiple
Afghanistan	b0	4	[4] 4th of multiple
Afghanistan	b0	5	[5] 5th of multiple

We can see the value labels of b0 in the above table. We see that the value labels are similar to other South Asian DHS datasets.

b4 - sex of child

We check the value labels of b4 variable which gives the sex of the child. First we create a nested tibble of b4’s value labels.

# Create the data dictionary for b4 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_b4 = map(afbr_data, \(df) {
    df |> 
      select(b4) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b4)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b4", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 10: Data dictionary of b4 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	b4	1	[1] male
Afghanistan	b4	2	[2] female

We can see the value labels of b4 in the above table. The value labels are similar to other South Asian DHS datasets.

b5 - child is alive

We check the value labels of b5 variable which gives the survival status of the child. First we create a nested tibble of b5’s value labels.

# Create the data dictionary for b5 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_b5 = map(afbr_data, \(df) {
    df |> 
      select(b5) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b5)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b5", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 11: Data dictionary of b5 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	b5	0	[0] no
Afghanistan	b5	1	[1] yes

The above table shows that the value labels of survival status of child. The value labels are similar to other South Asian DHS datasets.

b6 - age at death

We check the value labels of b6 variable which shows the age at death of children. Note that this variable has many missing values across all afbr rounds as not all children experienced mortality throughout their lifetime. First we create a nested tibble of b6’s value labels.

# Create the data dictionary for b5 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_b6 = map(afbr_data, \(df) {
    df |> 
      select(b6) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b6)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b6", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 12: Data dictionary of b6 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	b6	100	[100] died on day of birth
Afghanistan	b6	101	[101] days: 1
Afghanistan	b6	199	[199] days: number missing
Afghanistan	b6	201	[201] months: 1
Afghanistan	b6	299	[299] months: number missing
Afghanistan	b6	301	[301] years: 1
Afghanistan	b6	399	[399] years: number missing
Afghanistan	b6	997	[997] inconsistent
Afghanistan	b6	998	[998] don't know

We can see the value labels of b6 in the above table. The value labels are similar to the recent rounds of other South Asian DHS datasets.

b9 - child lives with whom

We check the value labels of b9 variable which gives info on who the child lives with. First we create a nested tibble of b9’s value labels.

# Create the data dictionary for b9 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_b9 = map(afbr_data, \(df) {
    df |> 
      select(b9) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b9)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b9", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 13: Data dictionary of b9 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	b9	0	[0] respondent
Afghanistan	b9	1	[1] father
Afghanistan	b9	2	[2] other relative
Afghanistan	b9	3	[3] someone else
Afghanistan	b9	4	[4] lives elsewhere

We can see the value labels of b9 in the above table. The value labels are similar to the DHS datasets of other South Asian countries.

b10 - completeness of information

We check the value labels of b10 variable which gives the completeness of birth history information. First we create a nested tibble of b10’s value labels.

# Create the data dictionary for b10 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_b10 = map(afbr_data, \(df) {
    df |> 
      select(b10) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b10)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b10", .before = 2) 

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |>
  align(align = "left", part = "all") |> 
  autofit()

Table 14: Data dictionary of b10 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	b10	1	[1] month and year - information complete
Afghanistan	b10	2	[2] month and age - year imputed
Afghanistan	b10	3	[3] year and age - month imputed
Afghanistan	b10	4	[4] year and age - year ignored
Afghanistan	b10	5	[5] year - age/month imputed
Afghanistan	b10	6	[6] age - year/month imputed
Afghanistan	b10	7	[7] month - age/year imputed
Afghanistan	b10	8	[8] none - all imputed

We can see the value labels of b6 in the above table. The value labels are similar to the recent rounds of other South Asian DHS datasets.

Checking the Common independent variables before harmonization

Next we start documenting the common independent variables. First we will check the data dictionary of the common independent variables. Then we will check them variable wise.

# We check the common independent vars in all afbr datasets.
# First we create the data dictionary in nested tibble.
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |>
  mutate(lookfor_comindvars = map(afbr_data, \(df) {
    df |> 
      # select the common independent variables
      select(v106, v011, v501, v701, v025, v151, v152, v190) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_comindvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 15: Data dictionary of common independent variables across the afbr rounds

ctr_name	svy_year	pos	variable	label	col_type	missing	unique_values	range
Afghanistan	2015	1	v106	highest educational level	dbl+lbl	0	4	0 - 3
Afghanistan	2015	2	v011	date of birth (cmc)	dbl	0	414	533 - 953
Afghanistan	2015	3	v501	current marital status	dbl+lbl	0	4	1 - 5
Afghanistan	2015	4	v701	husband/partner's education level	dbl+lbl	279	6	0 - 8
Afghanistan	2015	5	v025	type of place of residence	dbl+lbl	0	2	1 - 2
Afghanistan	2015	6	v151	sex of household head	dbl+lbl	0	2	1 - 2
Afghanistan	2015	7	v152	age of household head	dbl+lbl	1	84	11 - 95
Afghanistan	2015	8	v190	wealth index combined	dbl+lbl	0	5	1 - 5

From the above table we get an overall snapshot of the common independent variables. Next, we look at the labelled variables among these common variables in more details. We would like to see if the value labels and codes of the common independent variables in afbr 2015 are similar to other South Asian DHS datasets.

v106 - Mother’s education level

We check the value labels of v106 variable that denotes the highest education level of mother. First we create a nested tibble of v106’s value labels.

# Create the data dictionary for v106 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v106 = map(afbr_data, \(df) {
    df |> 
      select(v106) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v106)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v106", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 16: Data dictionary of v106 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v106	0	[0] no education
Afghanistan	v106	1	[1] primary
Afghanistan	v106	2	[2] secondary
Afghanistan	v106	3	[3] higher

We can see the value labels of v106 in the above table. The value labels are similar to the recent rounds of DHS datasets of other South Asian countries.

v011 - Date of birth (in CMC)

The v011 variable, which has the dob of mothers in cmc, is a numeric variable. Let’s check the range of these values in further details such as checking for outliers. First let’s create a nested tibble of the summary statistics of v011 variable.

# Create the summary statistics for v011 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(skim_v011 = map(afbr_data, \(df) {
    df |> 
      select(v011) |> 
      skim_without_charts() |> 
      as_tibble() |> 
      select(-c(skim_type, complete_rate)) |> 
      rename(
        variable = 1,
        n_miss = 2,
        mean = 3,
        sd = 4,
        min = 5,
        p25 = 6,
        p50 = 7,
        p75 = 8,
        max = 9
      )
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(skim_v011)) |> 
  # Make variable values have one decimal point 
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd)
  )

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 17: Data dictionary of v011 across the afbr rounds

ctr_name	svy_year	variable	n_miss	mean	sd	min	p25	p50	p75	max
Afghanistan	2015	v011	0	706.1	95.4	533	630	705	783	953

v501 - Mother’s marital status

We check the value labels of v501 variable which gives the current marital status of mother. First we create a nested tibble of v501’s value labels.

# Create the data dictionary for v501 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v501 = map(afbr_data, \(df) {
    df |> 
      select(v501) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v501)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v501", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 18: Data dictionary of v501 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v501	0	[0] never in union
Afghanistan	v501	1	[1] married
Afghanistan	v501	2	[2] living with partner
Afghanistan	v501	3	[3] widowed
Afghanistan	v501	4	[4] divorced
Afghanistan	v501	5	[5] no longer living together/separated

We can see that v501 has 6 value labels of v106 in the above table. The value labels are similar to the other recent South Asian DHS datasets.

v701 - Husband/Partner’s education level

We check the value labels of v701 variable which gives the current marital status of mother. First we create a nested tibble of v701’s value labels.

# Create the data dictionary for v701 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v701 = map(afbr_data, \(df) {
    df |> 
      select(v701) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v701)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v701", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 19: Data dictionary of v701 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v701	0	[0] no education
Afghanistan	v701	1	[1] primary
Afghanistan	v701	2	[2] secondary
Afghanistan	v701	3	[3] higher
Afghanistan	v701	8	[8] don't know

In the above table v701 has 5 value labels. The value labels are similar to the recent rounds of DHS datasets of other South Asian countries.

v025 - Type of place of residence

We check the value labels of v025 variable which shows if a household belongs to rural or urban psu. First we create a nested tibble of v025’s value labels.

# Create the data dictionary for v025 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v025 = map(afbr_data, \(df) {
    df |> 
      select(v025) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_v025)) |> 
  unnest(cols = c(lookfor_v025)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v025", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 20: Data dictionary of v025 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v025	1	[1] urban
Afghanistan	v025	2	[2] rural

We can see that v501 has 6 value labels of v106 in the above table. The value labels are similar to the recent DHS datasets of other South Asian countries.

v151 - Sex of household head

We check the value labels of v151 variable which gives the sex of the household head. First we create a nested tibble of v151’s value labels, then pivot wide and compare.

# Create the data dictionary for v151 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v151 = map(afbr_data, \(df) {
    df |> 
      select(v151) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v151)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v151", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 21: Data dictionary of v151 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v151	1	[1] male
Afghanistan	v151	2	[2] female

The values labels and codes for v151 are similar to the other South Asian DHS datasets.

v152 - Age of household head

Interestingly, we see v152 (a continuous variable) has value labels. Therefore, we check them. First we create a nested tibble of v152’s value labels.

# Create the data dictionary for v152 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v152 = map(afbr_data, \(df) {
    df |> 
      select(v152) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v152)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v152", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 22: Data dictionary of v152 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v152	97	[97] 97+
Afghanistan	v152	98	[98] don't know

We can see that the value labels of v152 are mostly for missing values. However, since v152 has no missing values, we need not be concerned about them.

v190 - Wealth quintile of household

We check the value labels of v190 which gives the wealth index quintile of births in a household. First we create a nested tibble of v190’s value labels, then pivot wide and compare.

# Create the data dictionary for v190 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  filter(svy_year != 1996) |> 
  mutate(lookfor_v190 = map(afbr_data, \(df) {
    df |> 
      select(v190) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v190)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v190", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 23: Data dictionary of v190 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v190	1	[1] poorest
Afghanistan	v190	2	[2] poorer
Afghanistan	v190	3	[3] middle
Afghanistan	v190	4	[4] richer
Afghanistan	v190	5	[5] richest

We can see from the above table that the value labels of v190 are similar to the recent rounds of DHS datasets of other South Asian countries.

Checking the Social group variables before harmonization

Now we document the social group variables and then harmonize them. Upon manually checking the full data dictionaries of each afbr dataset we find the following variables - ethnicity, language of interview and native language of respondent. First we will check the data dictionary of these social group variables. Then we will check them variable wise.

# We check the social group vars in all afbr datasets.
# First we create the data dictionary in nested tibble.
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |>
  mutate(lookfor_socgrp = map(afbr_data, \(df) {
    df |> 
      # select the social group variables
      select(v131, v045b, v045c) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_socgrp)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 24: Data dictionary of social group variables across the afbr rounds

ctr_name	svy_year	pos	variable	label	col_type	missing	unique_values	range
Afghanistan	2015	1	v131	ethnicity	dbl+lbl	227	10	1 - 96
Afghanistan	2015	2	v045b	language of interview	dbl+lbl	0	3	1 - 6
Afghanistan	2015	3	v045c	native language of respondent	dbl+lbl	0	3	1 - 6

The above table gives an overall snapshot of the social group variables. All the variables are of labelled class. Note that, the ethnicity variable has few missing values in the afbr dataset. Next, we look at the variables individually for matching the value labels across the afbr datasets.

v131 - Ethnicity of hh head

Next, we check the value labels of the v131 variable, which gives the ethnicity of household head. First we create a nested tibble of v131’s value labels.

# Create the data dictionary for v131 in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_v131 = map(afbr_data, \(df) {
    df |> 
      select(v131) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v131)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v131", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 25: Data dictionary of v131 across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v131	1	[1] pashtun
Afghanistan	v131	2	[2] tajik
Afghanistan	v131	3	[3] hazara
Afghanistan	v131	4	[4] uzbek
Afghanistan	v131	5	[5] turkmen
Afghanistan	v131	6	[6] nuristani
Afghanistan	v131	7	[7] baloch
Afghanistan	v131	8	[8] pashai
Afghanistan	v131	96	[96] other

The afbr 2015 has 9 ethnicity categories which are different from ethnicity variables in other South Asian countries like Nepal. Unfortunately, we do not know how to harmonize these categories. Therefore, we might not use this variable as a social group characteristic in the pooled dataset.

Language of interview

Next, we check the value labels of the language of interview variable. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_lang = map(afbr_data, \(df) {
    df |> 
      select(v045b) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_lang)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v045b", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 26: Data dictionary of language of interview variable across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v045b	1	[1] dari
Afghanistan	v045b	2	[2] pashto
Afghanistan	v045b	6	[6] other

The language of interview variable has 3 categories which are similar to the native language of respondent variable.

Native language of hh respondent

Next, we check the value labels of the native language of hh respondent variable. The variable names of this variable differs across the afbr datasets. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_lang = map(afbr_data, \(df) {
    df |> 
      select(v045c) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_lang)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v045c", .before = 2)

# Convert the tibble to flextable for easy viewing
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 27: Data dictionary of native language of respondent variable across the afbr rounds

ctr_name	var_name	label_num	afbr_2015
Afghanistan	v045c	1	[1] dari
Afghanistan	v045c	2	[2] pashto
Afghanistan	v045c	6	[6] other

The native language of respondent variable has 3 categories which are similar to the language of interview variable.

Correcting year-related variables

The year-related variables might have different formatting in each survey. Therefore, we need to check and harmonize them before appending the datasets.

# First we create the data dictionary of year-related vars in nested tibble
afbr1_pre_tmp1 <- afbr1_pre_tmp0 |> 
  mutate(lookfor_year = map(afbr_data, \(df) {
    df |> 
      select(c(b2, v007, v010)) |> 
      look_for(details = "full") |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character() |> 
      select(-c(levels:n_na))
  }))
afbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
afbr1_pre_tmp2 <- afbr1_pre_tmp1 |> 
  select(-c(unf, afbr_data, n_births)) |> 
  unnest(cols = c(lookfor_year)) |> 
  arrange(pos)
# Convert and view the tibble as flextable
afbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 28: Data dictionary of year-related variables across the afbr rounds

ctr_name	svy_year	pos	variable	label	col_type	unique_values	range
Afghanistan	2015	1	b2	year of birth	dbl	39	1356 - 1394
Afghanistan	2015	2	v007	year of interview	dbl	1	1394 - 1394
Afghanistan	2015	3	v010	respondent's year of birth	dbl	36	1344 - 1379

We find that none of the year variables have the correct values. Just look at the year of interview variable. We see that the year of interview values are lower than the corresponding survey years. Therefore, we need to reformat them before appending the datasets.

Afghanistan PR dataset use for family structure variables creation

Checking the ID variables before harmonization

Here we check the formatting of the constituent variables with which we will prepare the ID variables for the pooled Afghanistan person recode (PR) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all afpr datasets.
# First we create a data dictionary of the afpr datasets in nested tibble.
afpr1_pre_tmp1 <- afpr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(afpr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
afpr1_pre_tmp1

# Now we unnest the tibble and output the pooled data dictionary 
afpr1_pre_tmp2 <- afpr1_pre_tmp1 |> 
  select(c(ctr_name, svy_year, lookfor_idvars)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
afpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 29: Data dictionary of variables to be used for ID creation across the afpr rounds

ctr_name	svy_year	pos	variable	label	col_type	unique_values	range
Afghanistan	2015	1	hv001	cluster number	dbl	956	1 - 999
Afghanistan	2015	2	hv002	household number	dbl	352	1 - 952
Afghanistan	2015	3	hvidx	line number	dbl	48	1 - 48

From the above table we can see that all the three constituent ID variables are of numeric class with no missing values. These variables can directly be used for preparing the ID variables after finding the maximum length of their largest value. Note that survey year is also a constituent ID variable of 4-digits and we need not check it.

# We thought to process the above nested tibble further by decomposing the 
# "range" col into min and max values using separate_wider_regex().
# However, we hit a roadblock as pattern did not identify the max values in 
# some afpr rounds correctly
afpr1_pre_tmp3 <- afpr1_pre_tmp0 |> 
  # Generate the summary stats for id vars
  mutate(skim_idvars = map(afpr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      skim_without_charts()
  })) |> 
  # Pool the summary stats for all afpr rounds
  select(c(ctr_name, svy_year, skim_idvars)) |> 
  unnest(cols = c(skim_idvars)) |> 
  arrange(skim_variable, svy_year) |> 
  # Group and generate the max and min values for each variable
  group_by(variable = skim_variable) |> 
  summarize(
    min_val = min(numeric.p0),
    max_val = max(numeric.p100)
  ) |> 
  # calculate the num of digits in the maximum values
  mutate(
    max_digits = nchar(as.character(max_val))
  ) |>
  # add variable labels and relocate it after variable name
  bind_cols(vlabel = c("cluster number", "household number", "Persons line number")) |>
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
afpr1_pre_tmp3 |>
  qflextable() |>
  align(align = "left", part = "all") |>
  autofit()

Table 30: The maximum length of constituent ID variables to be set across the afpr rounds

variable	vlabel	min_val	max_val	max_digits
hv001	cluster number	1	999	3
hv002	household number	1	952	3
hvidx	Persons line number	1	48	2

The above table gives the required length of the constituent ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

Checking Family structure variables before harmonization

Here we check the family structure related variables before harmonizing them. The variable names were collected by manually checking the full data dictionaries. Here we will check the data dictionary of these hh-level variables and focus on the variable types.

# We check the family structure vars in all afpr datasets.
# First we create the data dictionary in nested tibble.
afpr1_pre_tmp1 <- afpr1_pre_tmp0 |>
  mutate(lookfor_famstrvars = map(afpr_data, \(df) {
    df |> 
      # select the common independent variables
      select(c(hv101, hv102, hv103, hv104, hv105)) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
afpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
afpr1_pre_tmp2 <- afpr1_pre_tmp1 |> 
  select(c(svy_year, lookfor_famstrvars)) |> 
  unnest(cols = c(lookfor_famstrvars)) |> 
  arrange(pos, svy_year)

# Convert the tibble to flextable for easy viewing
afpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 31: Data dictionary of family structure vars across the afpr rounds

svy_year	pos	variable	label	col_type	missing	unique_values	range
2015	1	hv101	relationship to head	dbl+lbl	4	13	1 - 98
2015	2	hv102	usual resident	dbl+lbl	11	3	0 - 1
2015	3	hv103	slept last night	dbl+lbl	96	3	0 - 1
2015	4	hv104	sex of household member	dbl+lbl	0	2	1 - 2
2015	5	hv105	age of household members	dbl+lbl	5	98	0 - 98

The above table gives an overall snapshot of the family structure related variables. Interestingly, all the variables including age of hh members (a continuous var) are of labelled class. The relation to head, de jure, de facto and age of household member variables have few missing values in afpr 2015. Next, we check the value labels of each of the variables.

hv101 - Relationship to head

Next, we check the value labels of the relationship to the household head variable. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
afpr1_pre_tmp1 <- afpr1_pre_tmp0 |> 
  mutate(lookfor_hv101 = map(afpr_data, \(df) {
    df |> 
      select(hv101) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afpr1_pre_tmp2 <- afpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv101)) |> 
  unnest(cols = c(lookfor_hv101)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv101", .before = 2)

# Convert the tibble to flextable for easy viewing
afpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 32: Data dictionary of relationship to head variable across the afpr rounds

ctr_name	var_name	label_num	afpr_2015
Afghanistan	hv101	1	[1] head
Afghanistan	hv101	2	[2] wife or husband
Afghanistan	hv101	3	[3] son/daughter
Afghanistan	hv101	4	[4] son/daughter-in-law
Afghanistan	hv101	5	[5] grandchild
Afghanistan	hv101	6	[6] parent
Afghanistan	hv101	7	[7] parent-in-law
Afghanistan	hv101	8	[8] brother/sister
Afghanistan	hv101	9	[9] co-spouse
Afghanistan	hv101	10	[10] other relative
Afghanistan	hv101	11	[11] adopted/foster child
Afghanistan	hv101	12	[12] not related
Afghanistan	hv101	13	[13] niece/nephew by blood
Afghanistan	hv101	14	[14] niece/nephew by marriage
Afghanistan	hv101	98	[98] don't know

The above table shows that the value label texts in afpr are similar to the DHS datasets of other South Asian countries. To harmonize the relationship to head variable we can use the following value labels -

1 head
2 spouse
3 child
4 child-in-law
5 grandchild
6 parent
7 parent-in-law
8 sibling
9 others

Here, we merge the “spouse” and “co-spouse” categories into “spouse” category, and the “son/daughter” and “adopted/foster child” categories into “child” category.

hv102 - de jure/usual resident

Next, we check the value labels of the de jure resident variable. This means if a household member is an usual resident of the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
afpr1_pre_tmp1 <- afpr1_pre_tmp0 |> 
  mutate(lookfor_hv102 = map(afpr_data, \(df) {
    df |> 
      select(hv102) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afpr1_pre_tmp2 <- afpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv102)) |> 
  unnest(cols = c(lookfor_hv102)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv102", .before = 2)

# Convert the tibble to flextable for easy viewing
afpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 33: Data dictionary of the De jure resident variable across the afpr rounds

ctr_name	var_name	label_num	afpr_2015
Afghanistan	hv102	0	[0] no
Afghanistan	hv102	1	[1] yes

The above table shows that hv102 has similar value label texts and codes to the DHS surveys of other South Asian countries. Therefore, we can use this variable directly after converting to factor type.

hv103 - de facto resident

Next, we check the value labels of the de facto resident variable. In DHS this means if a household member slept last night in the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
afpr1_pre_tmp1 <- afpr1_pre_tmp0 |> 
  mutate(lookfor_hv103 = map(afpr_data, \(df) {
    df |> 
      select(hv103) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
afpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
afpr1_pre_tmp2 <- afpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv103)) |> 
  unnest(cols = c(lookfor_hv103)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "afpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv103", .before = 2)

# Convert the tibble to flextable for easy viewing
afpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 34: Data dictionary of the De facto resident variable across the afpr rounds

ctr_name	var_name	label_num	afpr_2015
Afghanistan	hv103	0	[0] no
Afghanistan	hv103	1	[1] yes

The above table shows that hv103 has similar value label texts and codes to the DHS surveys of other South Asian countries. Therefore, we can use this variable directly after converting to factor type.

START FROM HERE

TASK:

Handling multiple births in death scarring vars may not be necessary.
Preceding birth interval construction has changed with DHS-7. We could re-construct it.

TO BE CONTINUED …