PKDHS data pooling pre-checks

Getting started

Here we show the pre-requisite code sections. Run these at the outset to avoid errors. First we load the required packages.

easypackages::libraries(
  # Data i/o
  "here",                 # relative file path
  "rio",                  # file import-export
  
  # Data manipulation
  "janitor",              # data cleaning fns
  "haven",                # stata, sas, spss data io
  "labelled",             # var labelling
  "readxl",               # excel sheets
  # "scales",               # to change formats and units
  "skimr",                # quick data summary
  "broom",                # view model results
  
  # Data analysis
  "DHS.rates",            # demographic rates for dhs-like surveys
  "GeneralOaxaca",        # BO decomposition for non-linear
  "survey",               # apply survey weights
  
  # Analysis output
  "gt",
  # "modelsummary",          # output summary tables
  "gtsummary",            # output summary tables
  "flextable",            # creating tables from objects
  "officer",              # editing in office docs
  
  # R graph related packages
  "ggstats",
  "RColorBrewer",
  # "scales",
  "patchwork",
  
  # Misc packages
  "tidyverse",            # Data manipulation iron man
  "tictoc"                # Code timing
)

Next we turn off scientific notations.

options(scipen = 999)

Next we set the default gtsummary print engine for tables.

theme_gtsummary_printer(print_engine = "flextable")

Now we set the flextable output defaults.

set_flextable_defaults(
  font.size = 11,
  text.align = "left",
  big.mark = "",
  background.color = "white",
  table.layout = "autofit",
  theme_fun = theme_vanilla
)

Document introduction

Here we document the variable codes and labels of variables across all the Pakistan Demographic and Health Survey (DHS) datasets. We check the variable labels and codes before running the pooling code in “daprep-v01_pkdhs.R”. We pool the following Pakistan DHS surveys:

# Creating the table of surveys to be used for pooling
pkbr1_tmp_intro |> 
  mutate(n_births = prettyNum(n_births, big.mark = ",")) |> 
  select(c(ctr_name, svy_year, n_births)) |> 
  # Join vars from pkir_tmp_intro
  left_join(
    pkir1_tmp_intro |> 
      mutate(n_women = prettyNum(n_women, big.mark = ",")) |> 
      select(c(year, n_women)),
    by = join_by(svy_year == year)
  ) |> 
  # Join vars from pkhr_tmp_intro
  left_join(
    pkhr1_tmp_intro |> 
      mutate(n_households = prettyNum(n_households, big.mark = ",")) |> 
      select(svy_year, n_households),
    by = join_by(svy_year)
  ) |> 
  # Join vars from pkpr_tmp_intro
  left_join(
    pkpr1_tmp_intro |> 
      mutate(n_persons = prettyNum(n_persons, big.mark = ",")) |> 
      select(svy_year, n_persons),
    by = join_by(svy_year)
  ) |> 
  # convert nested tibble to simple tibble
  unnest(cols = c()) |> 
  mutate(
    ccode = row_number(), 
    .before = ctr_name
  ) |> 
  # convert to flextable object
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 1: Pakistan DHS datasets and their sample size to be used for pooling

ccode	ctr_name	svy_year	n_births	n_women	n_households	n_persons
1	Pakistan	1990	27,369	6,611	7,193	52,358
2	Pakistan	2006	39,049	10,023	95,441	727,493
3	Pakistan	2012	50,238	13,558	12,943	94,169
4	Pakistan	2017	50,495	15,068	14,540	100,869

We use the following variables for the pooled data analysis:

Dependent variable
- infantd = Index child died during infancy period (0-11 months)
Main Independent variable
- sibsurv_nmv = Survival status of preceding child (Death scarring)
- binterval_3c_nmv_opp = Birth interval preceding to index child
Independent variables [CHILD LEVEL]
- cyob10y_opp = Birth cohort of index child
- bord_c = Birth order of index child
- sex_fm = Gender of index child
- season = Season during birth
Independent variables [MOTHER/PARENT LEVEL]
- ~~myob_opp = Birth cohort of mother~~
- macb_c_opp = Mother’s age during birth of index child
- medu_opp = Mother’s Level of education
- fedu_opp = Father’s level of education
Independent variables [HOUSEHOLD LEVEL]
- religion = Religion
- nat_lang = Native language of respondent
- wi_qt_opp = Household wealth quintile
- ~~hhgen_2c_opp = Generations in household~~
- hhstruc_opp = Household structure
- head_sex_fm = Sex of HH head
Independent variables [COMMUNITY LEVEL]
- por = Place of residence of the household
- ecoreg = Ecological region

Note: (a) Crossed names indicates variable not included.

Data import

We will directly import the nested tibble here. The code for dataset preparation is in the “daprep-v01_pkdhs.R” script file.

# Here we import the tibbles to be used for dataset checking
# Import the pkbr nested tibble
pkbr1_pre_tmp0 <- read_rds(file = here("website_data", "pkbr1_nest0.rds"))
# Import the pkhr nested tibble
pkhr1_pre_tmp0 <- read_rds(file = here("website_data", "pkhr1_nest0.rds"))
# Import the pkpr nested tibble
pkpr1_pre_tmp0 <- read_rds(file = here("website_data", "pkpr1_nest0.rds"))

Pakistan BR dataset use for variable creation

Checking the Women’s weight variable before harmonization

We will check the formatting of the v005 women’s weight variable before creating the pooled survey weight. For this we will use the labelled::look_for().

# First we create the data dictionary of v005 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v005 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v005) |> 
        look_for(details = "full") |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character() |> 
        select(-c(levels:n_na))
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v005)) |> 
  select(-pos) 
# Convert and view the tibble as flextable
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 2: Data dictionary of v005 variable across the pkbr rounds

ctr_name	svy_year	variable	label	col_type	unique_values	range
Pakistan	1990	v005	sample weight	dbl	407	13412 - 9562829
Pakistan	2006	v005	sample weight	dbl	970	21531 - 4605956
Pakistan	2012	v005	Women's individual sample weight (6 decimals)	dbl	491	10191 - 7939165
Pakistan	2017	v005	women's individual sample weight (6 decimals)	dbl	451	0 - 4095189

The women’s weight variables are in numeric class and have no missing values. However, in Pakistan DHS 2017 some observations have weight 0. Find out why?? Therefore, we need to reformat them. Before we directly use it for preparing the pooled survey weight.

TBD

Checking the ID variables before harmonization

Here we check the formatting of the variables using which we will prepare the ID variables for the pooled Pakistan birth history recode (br) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all pkbr datasets.
# First we create a data dictionary of the pkbr datasets in nested tibble.
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v001, v002, v003, bord, v021, v022, v023, v024) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and output the pooled data dictionary 
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 3: Data dictionary of variables to be used for ID creation across the pkbr rounds

ctr_name	svy_year	pos	variable	label	col_type	unique_values	range
Pakistan	1990	1	v001	cluster number	dbl	407	1101001 - 4204052
Pakistan	2006	1	v001	cluster number	dbl	970	1 - 980
Pakistan	2012	1	v001	Cluster number	dbl	498	1 - 500
Pakistan	2017	1	v001	cluster number	dbl	561	1 - 580
Pakistan	1990	2	v002	household number	dbl	25	1 - 25
Pakistan	2006	2	v002	household number	dbl	87	1 - 104
Pakistan	2012	2	v002	Household number	dbl	28	1 - 28
Pakistan	2017	2	v002	household number	dbl	29	1 - 29
Pakistan	1990	3	v003	respondent's line number	dbl	31	1 - 53
Pakistan	2006	3	v003	respondent's line number	dbl	35	1 - 43
Pakistan	2012	3	v003	Respondent's line number	dbl	39	1 - 47
Pakistan	2017	3	v003	respondent's line number	dbl	31	1 - 40
Pakistan	1990	4	bord	birth order number	dbl	16	1 - 16
Pakistan	2006	4	bord	birth order number	dbl	16	1 - 16
Pakistan	2012	4	bord	Birth order number	dbl	19	1 - 19
Pakistan	2017	4	bord	birth order number	dbl	15	1 - 15
Pakistan	1990	5	v021	primary sampling unit	dbl	407	1 - 407
Pakistan	2006	5	v021	primary sampling unit	dbl	970	1 - 980
Pakistan	2012	5	v021	Primary sampling unit	dbl	498	1 - 500
Pakistan	2017	5	v021	primary sampling unit	dbl	561	1 - 580
Pakistan	1990	6	v022	sample stratum number	dbl	180	1 - 183
Pakistan	2006	6	v022	sample stratum number	dbl	483	1 - 483
Pakistan	2012	6	v022	Sample strata for sampling errors	dbl	78	1 - 78
Pakistan	2017	6	v022	sample strata for sampling errors	dbl+lbl	16	1 - 16
Pakistan	1990	7	v023	sample domain	dbl+lbl	12	11 - 43
Pakistan	2006	7	v023	sample domain	dbl+lbl	4	1 - 4
Pakistan	2012	7	v023	Stratification used in sample design	dbl+lbl	10	1 - 10
Pakistan	2017	7	v023	stratification used in sample design	dbl+lbl	16	1 - 16
Pakistan	1990	8	v024	region	dbl+lbl	4	1 - 4
Pakistan	2006	8	v024	region	dbl+lbl	4	1 - 4
Pakistan	2012	8	v024	Region	dbl+lbl	6	1 - 6
Pakistan	2017	8	v024	region	dbl+lbl	8	1 - 8

From the above we can see that v023 and v024 are of labelled class, while the rest are in numeric class. Therefore, we will check the numeric and labelled variables in different ways. Note that although survey year is a constituent ID variable we have not checked it. It is imperative that survey year would be a 4-digit variable.

Numeric ID variables check

First, let’s find out the required length of the numeric ID variables by checking the maximum values of the constituent ID variable across the Pakistan DHS datasets. Here we estimate the summary stats of numeric constituent variables using skim_without_charts().

# Check the summary stats for ID vars using skimr in each pkbr dataset.
# First we estimate the summary stats using skim_without_charts().
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(
    skim_id_num = map(
      pkbr_data,
      function(df) {
        df |> 
          select(v001, v002, v003, bord, v021, v022) |> 
          skim_without_charts() |> 
          as_tibble() |> 
          select(-c(skim_type, n_missing, complete_rate)) |> 
          rename(
            variable = 1,
            mean = 2,
            sd = 3,
            min = 4,
            p25 = 5,
            p50 = 6,
            p75 = 7,
            max = 8
          )
      }
    )
  )
pkbr1_pre_tmp1

Next, we check the summary stats of numeric variables by variable name-wise.

# Now we unnest the nested tibble so that we can compare the variable length 
# across the pkbr datasets.
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(skim_id_num)) |> 
  arrange(variable, svy_year) |> 
  # change the decimal places of selected variables
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd),
    p75 = sprintf("%.0f", p75)
  )
# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 4: Summary statistics of the numeric ID variables

ctr_name	svy_year	variable	mean	sd	min	p25	p50	p75	max
Pakistan	1990	bord	3.6	2.4	1	2	3	5	16
Pakistan	2006	bord	3.5	2.3	1	2	3	5	16
Pakistan	2012	bord	3.4	2.3	1	2	3	5	19
Pakistan	2017	bord	3.1	2.1	1	1	3	4	15
Pakistan	1990	v001	2366771.9	1031851.2	1101001	1217015	2201003	3201024	4204052
Pakistan	2006	v001	510.9	284.3	1	272	514	759	980
Pakistan	2012	v001	251.6	138.6	1	134	259	370	500
Pakistan	2017	v001	283.9	165.2	1	141	279	414	580
Pakistan	1990	v002	10.3	6.0	1	5	10	15	25
Pakistan	2006	v002	51.7	29.1	1	27	49	77	104
Pakistan	2012	v002	14.4	8.1	1	7	14	21	28
Pakistan	2017	v002	14.3	8.1	1	7	14	21	29
Pakistan	1990	v003	2.9	2.7	1	2	2	2	53
Pakistan	2006	v003	3.3	3.4	1	2	2	2	43
Pakistan	2012	v003	3.2	3.3	1	2	2	2	47
Pakistan	2017	v003	3.0	2.7	1	2	2	3	40
Pakistan	1990	v021	215.0	113.9	1	120	226	312	407
Pakistan	2006	v021	510.9	284.3	1	272	514	759	980
Pakistan	2012	v021	251.6	138.6	1	134	259	370	500
Pakistan	2017	v021	283.9	165.2	1	141	279	414	580
Pakistan	1990	v022	98.8	50.8	1	57	106	141	183
Pakistan	2006	v022	253.8	140.4	1	136	256	377	483
Pakistan	2012	v022	43.5	22.9	1	26	44	64	78
Pakistan	2017	v022	6.8	4.6	1	3	6	10	16

Now we find out the required length of the numeric ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the numeric ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

# Processing the above nested tibble further
pkbr1_pre_tmp3 <- pkbr1_pre_tmp2 |> 
  group_by(variable) |> 
  # find the minimum and maximum values across surveys 
  summarize(
    min_val = min(min),
    max_val = max(max)
  ) |> 
  mutate(
    # calculate the num of digits in the maximum values
    max_digits = nchar(as.character(max_val)),
    # convert char var to factor
    variable = fct(
      variable, 
      levels = c("v001", "v002", "v003", "bord", "v021", "v022")
    )
  ) |> 
  # sort the rows by factor levels 
  arrange(variable) |> 
  # add variable labels and relocate it after variable name.
  bind_cols(vlabel = c("cluster number", "household number", 
                       "respondent's line number", "birth order", 
                       "primary sampling unit", "sample strata for se")) |> 
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp3 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 5: The maximum length of numeric variables to be set across the pkbr rounds for concatenating the ID variables

variable	vlabel	min_val	max_val	max_digits
v001	cluster number	1	4204052	7
v002	household number	1	104	3
v003	respondent's line number	1	53	2
bord	birth order	1	19	2
v021	primary sampling unit	1	980	3
v022	sample strata for se	1	483	3

Labelled ID variables check

First we check the labels in sub-national region variable coded as v024 across the pkbr datasets. Let’s create a nested tibble of v024’s value labels.

# Create the data dictionary for v024 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v024 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v024) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

Now we view the value labels of v024 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary 
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v024)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |> 
  # Show the variable name in a col
  mutate(var_name = "v024", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 6: Data dictionary of v024 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v024	1	[1] punjab	[1] punjab	[1] Punjab	[1] punjab
Pakistan	v024	2	[2] sindh	[2] sindh	[2] Sindh	[2] sindh
Pakistan	v024	3	[3] nw frontier	[3] nwfp	[3] Khyber Pakhtunkhwa	[3] kpk
Pakistan	v024	4	[4] balochistan	[4] balochistan	[4] Balochistan	[4] balochistan
Pakistan	v024	5			[5] Gilgit Baltistan	[5] gb
Pakistan	v024	6			[6] Islamabad (ICT)	[6] ict
Pakistan	v024	7				[7] ajk
Pakistan	v024	8				[8] fata

NOTE: The sub-national region var, v024 has different label values in each survey year. It was same for pkbr 1996, 2001 and 2006. After that the label values are different for each survey round.
VERD: In this analysis, we do not use the region var in the ID var.

Secondly, we check the labels in v023 variable that denotes the stratifications used for sampling design. First we create a nested tibble of v023’s value labels.

# Create the data dictionary for v023 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v023 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v023) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

Now we view the value labels of v023 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v023)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v023", .before = 2) 

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 7: Data dictionary of v023 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v023	1		[1] punjab	[1] Punjab urban	[1] punjab rural
Pakistan	v023	2		[2] sindh	[2] Punjab rural	[2] punjab urban
Pakistan	v023	3		[3] nwfp	[3] Sindh urban	[3] sindh rural
Pakistan	v023	4		[4] balochistan	[4] Sindh rural	[4] sindh urban
Pakistan	v023	5			[5] KPK urban	[5] kpk rural
Pakistan	v023	6			[6] KPK rural	[6] kpk urban
Pakistan	v023	7			[7] Balochistan urban	[7] balochistan rural
Pakistan	v023	8			[8] Balochistan rural	[8] balochistan urban
Pakistan	v023	9			[9] Gilgit Baltistan	[9] gb rural
Pakistan	v023	10			[10] Islamabad (ICT)	[10] gb urban
Pakistan	v023	11	[11] punjab large city			[11] ict rural
Pakistan	v023	12	[12] punjab small city			[12] ict urban
Pakistan	v023	13	[13] punjab rural			[13] ajk rural
Pakistan	v023	14				[14] ajk urban
Pakistan	v023	15				[15] fata rural
Pakistan	v023	16				[16] fata urban
Pakistan	v023	21	[21] sindh large city
Pakistan	v023	22	[22] sindh small city
Pakistan	v023	23	[23] sindh rural
Pakistan	v023	31	[31] nw frontier lrg city
Pakistan	v023	32	[32] nw frontier sml city
Pakistan	v023	33	[33] nw frontier rural
Pakistan	v023	41	[41] balochistan lrg city
Pakistan	v023	42	[42] balochistan sml city
Pakistan	v023	43	[43] balochistan rural

NOTE: The labels of v023 are different across the survey rounds.
VERD: Therefore we cannot use v023 in the ID variable preparation.

Altly, we can use the ecological region variable (secoreg) in the ID var. We will check for this in future.

Checking the Birth History variables before harmonization

Undoubtedly the birth history variables are important for this study objective. Therefore, we need to scrutinize all the birth history variables before using them to prepare harmonized variables for the pooled dataset.

# We check the birth history vars in all pkbr datasets.
# First we create a data dictionary in nested tibble.
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |>
  mutate(lookfor_bhvars = map(
    pkbr_data,
    \(df) {
      df |> 
        select(bidx, matches("^b[0-9]+")) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_bhvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 8: Data dictionary of birth history variables across the pkbr rounds

ctr_name	svy_year	pos	variable	label	col_type	missing	unique_values	range
Pakistan	1990	1	bidx	birth column number	dbl	0	16	1 - 16
Pakistan	2006	1	bidx	birth column number	dbl	0	16	1 - 16
Pakistan	2012	1	bidx	Birth column number	dbl	0	19	1 - 19
Pakistan	2017	1	bidx	birth column number	dbl	0	15	1 - 15
Pakistan	1990	2	b0	child is twin	dbl+lbl	0	4	0 - 3
Pakistan	2006	2	b0	child is twin	dbl+lbl	0	4	0 - 3
Pakistan	2012	2	b0	Child is twin	dbl+lbl	0	4	0 - 3
Pakistan	2017	2	b0	child is twin	dbl+lbl	0	4	0 - 3
Pakistan	1990	3	b1	month of birth	dbl	0	12	1 - 12
Pakistan	2006	3	b1	month of birth	dbl	0	12	1 - 12
Pakistan	2012	3	b1	Month of birth	dbl	0	12	1 - 12
Pakistan	2017	3	b1	month of birth	dbl	0	12	1 - 12
Pakistan	1990	4	b2	year of birth	dbl	0	39	52 - 91
Pakistan	2006	4	b2	year of birth	dbl	0	38	1970 - 2007
Pakistan	2012	4	b2	Year of birth	dbl	0	38	1976 - 2013
Pakistan	2017	4	b2	year of birth	dbl	0	37	1982 - 2018
Pakistan	1990	5	b3	date of birth (cmc)	dbl	0	415	635 - 1098
Pakistan	2006	5	b3	date of birth (cmc)	dbl	0	419	852 - 1286
Pakistan	2012	5	b3	Date of birth (CMC)	dbl	0	430	914 - 1359
Pakistan	2017	5	b3	date of birth (cmc)	dbl	0	413	988 - 1420
Pakistan	1990	6	b4	sex of child	dbl+lbl	0	2	1 - 2
Pakistan	2006	6	b4	sex of child	dbl+lbl	0	2	1 - 2
Pakistan	2012	6	b4	Sex of child	dbl+lbl	0	2	1 - 2
Pakistan	2017	6	b4	sex of child	dbl+lbl	0	2	1 - 2
Pakistan	1990	7	b5	child is alive	dbl+lbl	0	2	0 - 1
Pakistan	2006	7	b5	child is alive	dbl+lbl	0	2	0 - 1
Pakistan	2012	7	b5	Child is alive	dbl+lbl	0	2	0 - 1
Pakistan	2017	7	b5	child is alive	dbl+lbl	0	2	0 - 1
Pakistan	1990	8	b6	age at death	dbl+lbl	24210	83	100 - 397
Pakistan	2006	8	b6	age at death	dbl+lbl	34937	89	100 - 999
Pakistan	2012	8	b6	Age at death	dbl+lbl	45162	85	100 - 998
Pakistan	2017	8	b6	age at death	dbl+lbl	46660	80	100 - 330
Pakistan	1990	9	b7	age at death (months-imputed)	dbl	24194	48	0 - 300
Pakistan	2006	9	b7	age at death (months-imputed)	dbl	34937	57	0 - 396
Pakistan	2012	9	b7	Age at death (months, imputed)	dbl	45162	53	0 - 372
Pakistan	2017	9	b7	age at death (months, imputed)	dbl	46660	53	0 - 360
Pakistan	1990	10	b8	current age of child	dbl	3175	39	0 - 38
Pakistan	2006	10	b8	current age of child	dbl	4112	38	0 - 36
Pakistan	2012	10	b8	Current age of child	dbl	5076	38	0 - 36
Pakistan	2017	10	b8	current age of child	dbl	3835	37	0 - 35
Pakistan	1990	11	b9	who child lives with	dbl+lbl	3183	6	0 - 4
Pakistan	2006	11	b9	child lives with whom	dbl+lbl	4112	3	0 - 4
Pakistan	2012	11	b9	Child lives with whom	dbl+lbl	5076	3	0 - 4
Pakistan	2017	11	b9	child lives with whom	dbl+lbl	3835	3	0 - 4
Pakistan	1990	12	b10	completeness of information	dbl+lbl	0	8	1 - 8
Pakistan	2006	12	b10	completeness of information	dbl+lbl	0	8	1 - 8
Pakistan	2012	12	b10	Completeness of information	dbl+lbl	0	6	1 - 8
Pakistan	2017	12	b10	completeness of information	dbl+lbl	0	8	0 - 8
Pakistan	1990	13	b11	preceding birth interval	dbl	5926	146	7 - 221
Pakistan	2006	13	b11	preceding birth interval	dbl	8856	159	9 - 255
Pakistan	2012	13	b11	Preceding birth interval (months)	dbl	12050	171	9 - 241
Pakistan	2017	13	b11	preceding birth interval (months)	dbl	13235	166	7 - 236
Pakistan	1990	14	b12	succeeding birth interval	dbl	5972	146	7 - 221
Pakistan	2006	14	b12	succeeding birth interval	dbl	8911	159	9 - 255
Pakistan	2012	14	b12	Succeeding birth interval (months)	dbl	12127	171	9 - 241
Pakistan	2017	14	b12	succeeding birth interval (months)	dbl	13325	166	7 - 236
Pakistan	1990	15	b13	flag for age at death	dbl+lbl	24194	8	0 - 8
Pakistan	2006	15	b13	flag for age at death	dbl+lbl	34937	7	0 - 8
Pakistan	2012	15	b13	Flag for age at death	dbl+lbl	45162	4	0 - 8
Pakistan	2017	15	b13	flag for age at death	dbl+lbl	46660	3	0 - 1
Pakistan	2006	16	b15	live birth between births	dbl+lbl	0	3	0 - 9
Pakistan	2012	16	b15	Live birth between births	dbl+lbl	10726	4	0 - 9
Pakistan	2017	16	b15	live birth between births	dbl+lbl	11738	3	0 - 1
Pakistan	2006	17	b16	child's line number in household	dbl+lbl	4112	48	0 - 46
Pakistan	2012	17	b16	Child's line number in household	dbl+lbl	5076	50	0 - 48
Pakistan	2017	17	b16	child's line number in household	dbl+lbl	3835	43	0 - 43
Pakistan	2017	18	b17	day of birth	dbl	0	31	1 - 31
Pakistan	2017	19	b18	century day code of birth (cdc)	dbl	0	10013	30071 - 43195
Pakistan	2017	20	b19	current age of child in months (months since birth for dead children)	dbl	0	412	0 - 429
Pakistan	2017	21	b20	na - duration of pregnancy	dbl	50495	1

From the above table we get an overall snapshot of the birth history variables. We see that the variables b1-b13 are common in all the six pkbr datasets. Notably pkbr 2001 and 2006 have some extra variables that are not available in other rounds. Next, we look at the other labelled variables which are common across pkbr in more details. We would like to see if the value labels of the common birth history variables are similar across the pkbr datasets.

b0 - child is twin

We check the value labels of b0 variable that denotes whether the child is twin. First we create a nested tibble of b0’s value labels.

# Create the data dictionary for b0 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_b0 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(b0) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b0)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b0", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 9: Data dictionary of b0 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	b0	0	[0] single birth	[0] single birth	[0] Single birth	[0] single birth
Pakistan	b0	1	[1] 1st of multiple	[1] 1st of multiple	[1] 1st of multiple	[1] 1st of multiple
Pakistan	b0	2	[2] 2nd of multiple	[2] 2nd of multiple	[2] 2nd of multiple	[2] 2nd of multiple
Pakistan	b0	3	[3] 3rd of multiple	[3] 3rd of multiple	[3] 3rd of multiple	[3] 3rd of multiple
Pakistan	b0	4	[4] 4th of multiple	[4] 4th of multiple	[4] 4th of multiple	[4] 4th of multiple
Pakistan	b0	5	[5] 5th of multiple	[5] 5th of multiple	[5] 5th of multiple	[5] 5th of multiple

We can see the value labels of b0 in the above table. We see that the value labels are same across all the pkbr datasets.

b4 - sex of child

We check the value labels of b4 variable which gives the sex of the child. First we create a nested tibble of b4’s value labels.

# Create the data dictionary for b4 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_b4 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(b4) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b4)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b4", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 10: Data dictionary of b4 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	b4	1	[1] male	[1] male	[1] Male	[1] male
Pakistan	b4	2	[2] female	[2] female	[2] Female	[2] female

We can see the value labels of b4 in the above table. The value labels are same across all the pkbr datasets.

b5 - child is alive

We check the value labels of b5 variable which gives the survival status of the child. First we create a nested tibble of b5’s value labels.

# Create the data dictionary for b5 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_b5 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(b5) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b5)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b5", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 11: Data dictionary of b5 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	b5	0	[0] no	[0] no	[0] No	[0] no
Pakistan	b5	1	[1] yes	[1] yes	[1] Yes	[1] yes

The above table shows that the value labels of survival status of child are same across all the pkbr datasets.

b6 - age at death

We check the value labels of b6 variable which shows the age at death of children. Note that this variable has many missing values across all pkbr rounds as not all children experienced mortality throughout their lifetime. First we create a nested tibble of b6’s value labels.

# Create the data dictionary for b5 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_b6 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(b6) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b6)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b6", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 12: Data dictionary of b6 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	b6	100	[100] 0 days	[100] 0 days	[100] Died on day of birth	[100] died on day of birth
Pakistan	b6	101			[101] Days: 1	[101] days: 1
Pakistan	b6	199	[199] days missing		[199] Days: number missing	[199] days: number missing
Pakistan	b6	201	[201] one month	[201] 1 month	[201] Months: 1	[201] months: 1
Pakistan	b6	299	[299] months missing		[299] Months: number missing	[299] months: number missing
Pakistan	b6	301	[301] one year	[301] 1 year	[301] Years: 1	[301] years: 1
Pakistan	b6	397	[397] years inconsistent
Pakistan	b6	399			[399] Years: number missing	[399] years: number missing
Pakistan	b6	997	[997] inconsistent	[997] inconsistent	[997] Inconsistent	[997] inconsistent
Pakistan	b6	998	[998] don't know	[998] don't know	[998] Don't know	[998] don't know

The above table shows that the value labels of age at death of child are in two groups. First, they are same for pkbr 1996, 2001 and 2006 and and then for pkbr 2011, 2016 and 2022.

b9 - child lives with whom

We check the value labels of b9 variable which gives info on who the child lives with. First we create a nested tibble of b9’s value labels.

# Create the data dictionary for b9 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_b9 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(b9) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b9)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b9", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 13: Data dictionary of b9 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	b9	0	[0] respondent	[0] respondent	[0] Respondent	[0] respondent
Pakistan	b9	1	[1] father	[1] father	[1] Father	[1] father
Pakistan	b9	2	[2] other relative	[2] other relative	[2] Other relative	[2] other relative
Pakistan	b9	3	[3] someone else	[3] someone else	[3] Someone else	[3] someone else
Pakistan	b9	4	[4] 15+ & live elsewhere	[4] lives elsewhere	[4] Lives elsewhere	[4] lives elsewhere

We can see in the above table that the value labels of b9 are same across all the pkbr datasets.

b10 - completeness of information

We check the value labels of b10 variable which gives the completeness of birth history information. First we create a nested tibble of b10’s value labels.

# Create the data dictionary for b10 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_b10 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(b10) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b10)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b10", .before = 2) 

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |>
  align(align = "left", part = "all") |> 
  autofit()

Table 14: Data dictionary of b10 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	b10	0				[0] month, year and day
Pakistan	b10	1	[1] month and year	[1] month and year	[1] Month and year - information complete	[1] month and year - information complete
Pakistan	b10	2	[2] month and age -y imp	[2] month and age -y imp	[2] Month and age - year imputed	[2] month and age - year imputed
Pakistan	b10	3	[3] year and age - m imp	[3] year and age - m imp	[3] Year and age - month imputed	[3] year and age - month imputed
Pakistan	b10	4	[4] y & age - y ignored	[4] y & age - y ignored	[4] Year and age - year ignored	[4] year and age - year ignored
Pakistan	b10	5	[5] year - a, m imp	[5] year - a, m imp	[5] Year - age/month imputed	[5] year - age/month imputed
Pakistan	b10	6	[6] age - y, m imp	[6] age - y, m imp	[6] Age - year/month imputed	[6] age - year/month imputed
Pakistan	b10	7	[7] month - a, y imp	[7] month - a, y imp	[7] Month - age/year imputed	[7] month - age/year imputed
Pakistan	b10	8	[8] none - all imp	[8] none - all imp	[8] None - all imputed	[8] none - all imputed

We can see in the above table that the value labels of b10 are same across pkbr 1996, 2001, 2006 and 2011 datasets. Then they are same for pkbr 2016 and 2022.

Checking the Common independent variables before harmonization

Next we start documenting the common independent variables. First we will check the data dictionary of the common independent variables. Then we will check them variable wise.

# We check the common independent vars in all pkbr datasets.
# First we create the data dictionary in nested tibble.
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |>
  mutate(lookfor_comindvars = map(
    pkbr_data,
    \(df) {
      df |> 
        # select the common independent variables
        select(v106, v011, v501, v701, v025, v151, v152) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_comindvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 15: Data dictionary of common independent variables across the pkbr rounds

ctr_name	svy_year	pos	variable	label	col_type	missing	unique_values	range
Pakistan	1990	1	v106	highest educational level	dbl+lbl	0	4	0 - 3
Pakistan	2006	1	v106	highest educational level	dbl+lbl	0	4	0 - 3
Pakistan	2012	1	v106	Highest educational level	dbl+lbl	0	4	0 - 3
Pakistan	2017	1	v106	highest educational level	dbl+lbl	0	4	0 - 3
Pakistan	1990	2	v011	date of birth (cmc)	dbl	0	409	493 - 912
Pakistan	2006	2	v011	date of birth (cmc)	dbl	0	409	683 - 1102
Pakistan	2012	2	v011	Date of birth (CMC)	dbl	0	402	755 - 1164
Pakistan	2017	2	v011	date of birth (cmc)	dbl	0	410	817 - 1233
Pakistan	1990	3	v501	current marital status	dbl+lbl	0	4	1 - 5
Pakistan	2006	3	v501	current marital status	dbl+lbl	0	4	1 - 5
Pakistan	2012	3	v501	Current marital status	dbl+lbl	0	4	1 - 5
Pakistan	2017	3	v501	current marital status	dbl+lbl	0	4	1 - 5
Pakistan	1990	4	v701	partner's education level	dbl+lbl	114	6	0 - 8
Pakistan	2006	4	v701	partner's education level	dbl+lbl	0	6	0 - 9
Pakistan	2012	4	v701	Husband/partner's education level	dbl+lbl	0	6	0 - 9
Pakistan	2017	4	v701	husband/partner's education level	dbl+lbl	1872	6	0 - 8
Pakistan	1990	5	v025	type of place of residence	dbl+lbl	0	2	1 - 2
Pakistan	2006	5	v025	type of place of residence	dbl+lbl	0	2	1 - 2
Pakistan	2012	5	v025	Type of place of residence	dbl+lbl	0	2	1 - 2
Pakistan	2017	5	v025	type of place of residence	dbl+lbl	0	2	1 - 2
Pakistan	1990	6	v151	sex of household head	dbl+lbl	0	2	1 - 2
Pakistan	2006	6	v151	sex of household head	dbl+lbl	0	2	1 - 2
Pakistan	2012	6	v151	Sex of household head	dbl+lbl	0	2	1 - 2
Pakistan	2017	6	v151	sex of household head	dbl+lbl	0	2	1 - 2
Pakistan	1990	7	v152	age of household head	dbl+lbl	3	79	12 - 98
Pakistan	2006	7	v152	age of household head	dbl+lbl	0	80	14 - 99
Pakistan	2012	7	v152	Age of household head	dbl+lbl	0	77	14 - 98
Pakistan	2017	7	v152	age of household head	dbl+lbl	0	79	15 - 95

From the above table we get an overall snapshot of the common independent variables. We see that majority of the have different number of value labels across the six pkbr datasets. Only v025 and v151 have the same number of value labels across pkbr rounds. Next, we look at the labelled variables among these common variables in more details. We would like to see if the value labels and codes of the common independent variables are similar across the pkbr datasets.

v106 - Mother’s education level

We check the value labels of v106 variable that denotes the highest education level of mother. First we create a nested tibble of v106’s value labels.

# Create the data dictionary for v106 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v106 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v106) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v106)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v106", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 16: Data dictionary of v106 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v106	0	[0] no education	[0] no education	[0] No education	[0] no education
Pakistan	v106	1	[1] primary	[1] primary	[1] Primary	[1] primary
Pakistan	v106	2	[2] secondary	[2] secondary	[2] Secondary	[2] secondary
Pakistan	v106	3	[3] higher	[3] higher	[3] Higher	[3] higher

We can see the value labels of v106 are mostly similar except for pkbr 1996 and 2011 datasets.

v011 - Date of birth (in CMC)

The v011 variable, which has the dob of mothers in cmc, is a numeric variable. Let’s check the range of these values in further details such as checking for outliers. First let’s create a nested tibble of the summary statistics of v011 variable.

# Create the summary statistics for v011 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(skim_v011 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v011) |> 
        skim_without_charts() |> 
        as_tibble() |> 
        select(-c(skim_type, complete_rate)) |> 
        rename(
          variable = 1,
          n_miss = 2,
          mean = 3,
          sd = 4,
          min = 5,
          p25 = 6,
          p50 = 7,
          p75 = 8,
          max = 9
        )
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(skim_v011)) |> 
  # Make variable values have one decimal point 
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd)
  )

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 17: Data dictionary of v011 across the pkbr rounds

ctr_name	svy_year	variable	mean	sd	min	p25	p50	p75	max
Pakistan	1990	v011	666.1	90.8	493	596	665	732	912
Pakistan	2006	v011	840.1	92.6	683	763	836	915	1102
Pakistan	2012	v011	910.9	91.5	755	837	907	985	1164
Pakistan	2017	v011	979.2	89.9	817	907	978	1049	1233

v501 - Mother’s marital status

We check the value labels of v501 variable which gives the current marital status of mother. First we create a nested tibble of v501’s value labels.

# Create the data dictionary for v501 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v501 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v501) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v501)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v501", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 18: Data dictionary of v501 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v501	0	[0] never married	[0] never married	[0] Never in union	[0] never in union
Pakistan	v501	1	[1] married	[1] married	[1] Married	[1] married
Pakistan	v501	2	[2] living together	[2] living together	[2] Living with partner	[2] living with partner
Pakistan	v501	3	[3] widowed	[3] widowed	[3] Widowed	[3] widowed
Pakistan	v501	4	[4] divorced	[4] divorced	[4] Divorced	[4] divorced
Pakistan	v501	5	[5] not living together	[5] not living together	[5] No longer living together/separated	[5] no longer living together/separated

All the pkbr rounds have 5 value labels. The pkbr 1996, 2001 and 2006 rounds have a set of similar value label texts. Then pkbr 2011, 2016 and 2022 have another set of similar value labels.

v701 - Husband/Partner’s education level

We check the value labels of v701 variable which gives the current marital status of mother. First we create a nested tibble of v701’s value labels.

# Create the data dictionary for v701 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v701 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v701) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v701)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v701", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 19: Data dictionary of v701 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v701	0	[0] no education	[0] no education	[0] No education	[0] no education
Pakistan	v701	1	[1] primary	[1] primary	[1] Primary	[1] primary
Pakistan	v701	2	[2] secondary	[2] secondary	[2] Secondary	[2] secondary
Pakistan	v701	3	[3] higher	[3] higher	[3] Higher	[3] higher
Pakistan	v701	8	[8] don't know	[8] don't know level	[8] Don't know	[8] don't know

All the pkbr rounds have 5 value labels. The pkbr 1996, 2001 and 2006 rounds and pkbr 2011, 2016 and 2022 have a similar set of value labels with a difference in wording among them.

v025 - Type of place of residence

We check the value labels of v025 variable which shows if a household belongs to rural or urban psu. First we create a nested tibble of v025’s value labels.

# Create the data dictionary for v025 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v025 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v025) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_v025)) |> 
  unnest(cols = c(lookfor_v025)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v025", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 20: Data dictionary of v025 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v025	1	[1] urban	[1] urban	[1] Urban	[1] urban
Pakistan	v025	2	[2] rural	[2] rural	[2] Rural	[2] rural

The values labels and codes for v025 are same across all the pkbr rounds.

v151 - Sex of household head

We check the value labels of v151 variable which gives the gender of the household head. First we create a nested tibble of v151’s value labels.

# Create the data dictionary for v151 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_v151 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v151) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v151)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v151", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 21: Data dictionary of v151 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v151	1	[1] male	[1] male	[1] Male	[1] male
Pakistan	v151	2	[2] female	[2] female	[2] Female	[2] female

The values labels and codes for v151 are same across all the pkbr rounds.

v152 - Age of household head

Interestingly, we see v152 (a continuous variable) has value labels for all rounds except pkbr 1996. Therefore, we check the value labels of v152 for those rounds. First we create a nested tibble of v152’s value labels.

# Create the data dictionary for v152 in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  filter(svy_year != 1996) |> 
  mutate(lookfor_v152 = map(
    pkbr_data,
    \(df) {
      df |> 
        select(v152) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v152)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v152", .before = 2)

# Convert the tibble to flextable for easy viewing
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 22: Data dictionary of v152 across the pkbr rounds

ctr_name	var_name	label_num	pkbr_1990	pkbr_2006	pkbr_2012	pkbr_2017
Pakistan	v152	95				[95] 95+
Pakistan	v152	97	[97] 97+	[97] 97+	[97] 97+
Pakistan	v152	98	[98] dk	[98] dk	[98] Don't know	[98] don't know

We can see that the value labels of v152 are mostly for missing values. However, since v152 has no missing values across the pkbr rounds, we need not be concerned about them.

Correcting year-related variables

The year-related variables might have different formatting in each survey. Therefore, we need to check and harmonize them before appending the datasets.

# First we create the data dictionary of year-related vars in nested tibble
pkbr1_pre_tmp1 <- pkbr1_pre_tmp0 |> 
  mutate(lookfor_year = map(
    pkbr_data,
    \(df) {
      df |> 
        select(c(b2, v007, v010)) |> 
        look_for(details = "full") |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character() |> 
        select(-c(levels:n_na))
    }
  ))
pkbr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
pkbr1_pre_tmp2 <- pkbr1_pre_tmp1 |> 
  select(-c(unf, pkbr_data, n_births)) |> 
  unnest(cols = c(lookfor_year)) |> 
  arrange(pos)
# Convert and view the tibble as flextable
pkbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 23: Data dictionary of year-related variables across the pkbr rounds

ctr_name	svy_year	pos	variable	label	col_type	unique_values	range
Pakistan	1990	1	b2	year of birth	dbl	39	52 - 91
Pakistan	2006	1	b2	year of birth	dbl	38	1970 - 2007
Pakistan	2012	1	b2	Year of birth	dbl	38	1976 - 2013
Pakistan	2017	1	b2	year of birth	dbl	37	1982 - 2018
Pakistan	1990	2	v007	year of interview	dbl	2	90 - 91
Pakistan	2006	2	v007	year of interview	dbl	2	2006 - 2007
Pakistan	2012	2	v007	Year of interview	dbl	2	2012 - 2013
Pakistan	2017	2	v007	year of interview	dbl	2	2017 - 2018
Pakistan	1990	3	v010	respondent's year of birth	dbl	35	41 - 75
Pakistan	2006	3	v010	respondent's year of birth	dbl	36	1956 - 1991
Pakistan	2012	3	v010	Respondent's year of birth	dbl	35	1962 - 1996
Pakistan	2017	3	v010	respondent's year of birth	dbl	35	1968 - 2002

We find that none of the year variables have the correct values. Just look at the year of interview variable. We see that the year of interview values are higher than the corresponding survey years. Therefore, we need to reformat them before appending the datasets.

Pakistan HH dataset use for variable creation

Checking the ID variables before harmonization

Here we check the formatting of the constituent variables with which we will prepare the ID variables for the pooled Pakistan household recode (hr) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all pkhr datasets.
# First we create a data dictionary of the pkhr datasets in nested tibble.
pkhr1_pre_tmp1 <- pkhr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(
    pkhr_data,
    \(df) {
      df |> 
        select(hv001, hv002) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
pkhr1_pre_tmp1

# Now we unnest the tibble and output the pooled data dictionary 
pkhr1_pre_tmp2 <- pkhr1_pre_tmp1 |> 
  select(c(ctr_name, svy_year, lookfor_idvars)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
pkhr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 24: Data dictionary of variables to be used for ID creation across the pkhr rounds

ctr_name	svy_year	pos	variable	label	col_type	unique_values	range
Pakistan	1990	1	hv001	cluster number	dbl	407	1101001 - 4204052
Pakistan	2006	1	hv001	cluster number	dbl	972	1 - 980
Pakistan	2012	1	hv001	Cluster number	dbl	498	1 - 500
Pakistan	2017	1	hv001	cluster number	dbl	561	1 - 580
Pakistan	1990	2	hv002	household number	dbl	25	1 - 25
Pakistan	2006	2	hv002	household number	dbl	105	1 - 105
Pakistan	2012	2	hv002	Household number	dbl	28	1 - 28
Pakistan	2017	2	hv002	household number	dbl	29	1 - 29

From the above we can see that both the hv001 and hv002 are of numeric class with no missing values. These variables can be used for preparing the ID variables after finding the maximum length of their largest value. Note that survey year is also a constituent ID variable of 4-digits and we need not check it.

# We thought to process the above nested tibble further by decomposing the 
# "range" col into min and max values using separate_wider_regex().
# However, we hit a roadblock as pattern did not identify the max values in 
# some pkhr rounds correctly
pkhr1_pre_tmp3 <- pkhr1_pre_tmp0 |> 
  # Generate the summary stats for id vars
  mutate(skim_idvars = map(pkhr_data, \(df) {
    df |> 
      select(hv001, hv002) |> 
      skim_without_charts()
  })) |> 
  # Pool the summary stats for all pkhr rounds
  select(c(ctr_name, svy_year, skim_idvars)) |> 
  unnest(cols = c(skim_idvars)) |> 
  arrange(skim_variable, svy_year) |> 
  # Group and generate the max and min values for each variable
  group_by(variable = skim_variable) |> 
  summarize(
    min_val = min(numeric.p0),
    max_val = max(numeric.p100)
  ) |> 
  # calculate the num of digits in the maximum values
  mutate(
    max_digits = nchar(as.character(max_val))
  ) |>
  # add variable labels and relocate it after variable name
  bind_cols(vlabel = c("cluster number", "household number")) |>
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
pkhr1_pre_tmp3 |>
  qflextable() |>
  align(align = "left", part = "all") |>
  autofit()

Table 25: The maximum length of constituent ID variables to be set across the pkhr rounds

variable	vlabel	min_val	max_val	max_digits
hv001	cluster number	1	4204052	7
hv002	household number	1	105	3

The above table gives the required length of the constituent ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

Checking HH-level variables before harmonization

Here we check the wealth quintile variables before harmonizing them. Note in Pakistan 1990 the wealth quintile variables are provided in separate datasets. Therefore we join those variables to the hh file before proceeding with the checking.

Upon manually checking the full data dictionaries we find the variable names. Now we will check the data dictionary of these hh-level variables. Then we will check their value labels variable wise.

# We check the hh-level vars in all pkhr datasets.
# First we create the data dictionary in nested tibble.
pkhr1_pre_tmp1 <- pkhr1_pre_tmp0 |>
  mutate(lookfor_hhvars = map(
    pkhr_data,
    \(df) {
      df |> 
        # select the common independent variables
        select(matches("^wlthind5$|^hv270$")) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
pkhr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
pkhr1_pre_tmp2 <- pkhr1_pre_tmp1 |> 
  select(c(svy_year, lookfor_hhvars)) |> 
  unnest(cols = c(lookfor_hhvars)) |> 
  arrange(pos, svy_year)

# Convert the tibble to flextable for easy viewing
pkhr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 26: Data dictionary of hh-level variables across the pkhr rounds

svy_year	pos	variable	label	col_type	missing	unique_values	range
1990	1	wlthind5	quintiles of wealth index	dbl+lbl	0	5	1 - 5
2006	1	hv270	wealth index	dbl+lbl	81597	6	1 - 5
2012	1	hv270	Wealth index	dbl+lbl	0	5	1 - 5
2017	1	hv270	wealth index combined	dbl+lbl	0	5	1 - 5

The above table gives an overall snapshot of the hh-level variables. All the variables are of labelled class and have the same number of value labels across all the pkhr datasets. Note that, the ecological region variable has a different value label code in the pkbr 1996 dataset. Next, we compare the value labels of the variables across the pkhr datasets.

Wealth index quintile variable

Next, we check the value labels of the household wealth quintile variable. The variable names of this variable differs across the pkbr datasets. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
pkhr1_pre_tmp1 <- pkhr1_pre_tmp0 |> 
  mutate(lookfor_wiqt = map(
    pkhr_data,
    \(df) {
      df |> 
        select(matches("^wlthind5$|^hv270$")) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
pkhr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkhr1_pre_tmp2 <- pkhr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_wiqt)) |> 
  unnest(cols = c(lookfor_wiqt)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkhr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "Wealth index quintiles", .before = 2)

# Convert the tibble to flextable for easy viewing
pkhr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 27: Data dictionary of wealth quintiles across the pkhr rounds

ctr_name	var_name	label_num	pkhr_1990	pkhr_2006	pkhr_2012	pkhr_2017
Pakistan	Wealth index quintiles	1	[1] lowest quintile	[1] poorest	[1] Poorest	[1] poorest
Pakistan	Wealth index quintiles	2	[2] second quintile	[2] poorer	[2] Poorer	[2] poorer
Pakistan	Wealth index quintiles	3	[3] middle quintile	[3] middle	[3] Middle	[3] middle
Pakistan	Wealth index quintiles	4	[4] fourth quintile	[4] richer	[4] Richer	[4] richer
Pakistan	Wealth index quintiles	5	[5] highest quintile	[5] richest	[5] Richest	[5] richest

Clearly, the value label codes are same in all pkhr rounds. However, the value label texts are different in pkhr 1996 and 2001, compared to the pkhr 2006, 2011, 2016 and 2022 rounds. Therefore, we need to be mindful of this during harmonization.

Pakistan PR dataset use for family structure variables creation

Checking the ID variables before harmonization

Here we check the formatting of the constituent variables with which we will prepare the ID variables for the pooled Pakistan person recode (pr) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all pkpr datasets.
# First we create a data dictionary of the pkpr datasets in nested tibble.
pkpr1_pre_tmp1 <- pkpr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(pkpr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
pkpr1_pre_tmp1

# Now we unnest the tibble and output the pooled data dictionary 
pkpr1_pre_tmp2 <- pkpr1_pre_tmp1 |> 
  select(c(ctr_name, svy_year, lookfor_idvars)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
pkpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 28: Data dictionary of variables to be used for ID creation across the pkpr rounds

ctr_name	svy_year	pos	variable	label	col_type	unique_values	range
Pakistan	1990	1	hv001	cluster number	dbl	407	1101001 - 4204052
Pakistan	2006	1	hv001	cluster number	dbl	972	1 - 980
Pakistan	2012	1	hv001	Cluster number	dbl	498	1 - 500
Pakistan	2017	1	hv001	cluster number	dbl	561	1 - 580
Pakistan	1990	2	hv002	household number	dbl	25	1 - 25
Pakistan	2006	2	hv002	household number	dbl	105	1 - 105
Pakistan	2012	2	hv002	Household number	dbl	28	1 - 28
Pakistan	2017	2	hv002	household number	dbl	29	1 - 29
Pakistan	1990	3	hvidx	line number	dbl	55	1 - 55
Pakistan	2006	3	hvidx	line number	dbl	60	1 - 60
Pakistan	2012	3	hvidx	Line number	dbl	48	1 - 48
Pakistan	2017	3	hvidx	line number	dbl	44	1 - 44

From the above table we can see that all the three constituent ID variables are of numeric class with no missing values. These variables can directly be used for preparing the ID variables after finding the maximum length of their largest value. Note that survey year is also a constituent ID variable of 4-digits and we need not check it.

# We thought to process the above nested tibble further by decomposing the 
# "range" col into min and max values using separate_wider_regex().
# However, we hit a roadblock as pattern did not identify the max values in 
# some pkpr rounds correctly
pkpr1_pre_tmp3 <- pkpr1_pre_tmp0 |> 
  # Generate the summary stats for id vars
  mutate(skim_idvars = map(pkpr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      skim_without_charts()
  })) |> 
  # Pool the summary stats for all pkpr rounds
  select(c(ctr_name, svy_year, skim_idvars)) |> 
  unnest(cols = c(skim_idvars)) |> 
  arrange(skim_variable, svy_year) |> 
  # Group and generate the max and min values for each variable
  group_by(variable = skim_variable) |> 
  summarize(
    min_val = min(numeric.p0),
    max_val = max(numeric.p100)
  ) |> 
  # calculate the num of digits in the maximum values
  mutate(
    max_digits = nchar(as.character(max_val))
  ) |>
  # add variable labels and relocate it after variable name
  bind_cols(vlabel = c("cluster number", "household number", "Persons line number")) |>
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
pkpr1_pre_tmp3 |>
  qflextable() |>
  align(align = "left", part = "all") |>
  autofit()

Table 29: The maximum length of constituent ID variables to be set across the pkpr rounds

variable	vlabel	min_val	max_val	max_digits
hv001	cluster number	1	4204052	7
hv002	household number	1	105	3
hvidx	Persons line number	1	60	2

Checking Family structure variables before harmonization

Here we check the family structure related variables before harmonizing them. The variable names were collected by manually checking the full data dictionaries. Here we will check the data dictionary of these hh-level variables and focus on the variable types.

# We check the family structure vars in all pkpr datasets.
# First we create the data dictionary in nested tibble.
pkpr1_pre_tmp1 <- pkpr1_pre_tmp0 |>
  mutate(lookfor_famstrvars = map(pkpr_data, \(df) {
    df |> 
      # select the common independent variables
      select(c(hv101, hv102, hv103, hv104, hv105)) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
pkpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary 
pkpr1_pre_tmp2 <- pkpr1_pre_tmp1 |> 
  select(c(svy_year, lookfor_famstrvars)) |> 
  unnest(cols = c(lookfor_famstrvars)) |> 
  arrange(pos, svy_year)

# Convert the tibble to flextable for easy viewing
pkpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 30: Data dictionary of family structure vars across the pkpr rounds

svy_year	pos	variable	label	col_type	missing	unique_values	range
1990	1	hv101	relationship to head	dbl+lbl	11	13	1 - 98
2006	1	hv101	relationship to head	dbl+lbl	0	13	1 - 99
2012	1	hv101	Relationship to head	dbl+lbl	0	18	1 - 99
2017	1	hv101	relationship to head	dbl+lbl	0	17	1 - 98
1990	2	hv102	usual resident	dbl+lbl	2	3	0 - 1
2006	2	hv102	usual resident	dbl+lbl	0	3	0 - 9
2012	2	hv102	Usual resident	dbl+lbl	0	3	0 - 9
2017	2	hv102	usual resident	dbl+lbl	1	3	0 - 1
1990	3	hv103	slept last night	dbl+lbl	12	3	0 - 1
2006	3	hv103	slept last night	dbl+lbl	0	3	0 - 9
2012	3	hv103	Slept last night	dbl+lbl	0	3	0 - 9
2017	3	hv103	slept last night	dbl+lbl	2	3	0 - 1
1990	4	hv104	sex of household member	dbl+lbl	2	3	1 - 2
2006	4	hv104	sex of household member	dbl+lbl	0	3	1 - 9
2012	4	hv104	Sex of household member	dbl+lbl	0	2	1 - 2
2017	4	hv104	sex of household member	dbl+lbl	0	2	1 - 2
1990	5	hv105	age of household members	dbl+lbl	23	98	0 - 98
2006	5	hv105	age of household members	dbl+lbl	0	99	0 - 99
2012	5	hv105	Age of household members	dbl+lbl	0	98	0 - 99
2017	5	hv105	age of household members	dbl+lbl	1	98	0 - 98

The above table gives an overall snapshot of the family structure related variables. Interestingly, all the variables including age of hh members (a continuous var) are of labelled class. The relation to head and de facto resident variables have few missing values in pkpr 1996. Note that, the three variables of interest hv101-hv102, two variables hv101 and hv103 have different number of value labels across the pkpr rounds. Next, we compare the value labels of the individual variables across the pkpr datasets.

hv101 - Relationship to head

Next, we check the value labels of the relationship to the household head variable. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
pkpr1_pre_tmp1 <- pkpr1_pre_tmp0 |> 
  mutate(lookfor_hv101 = map(pkpr_data, \(df) {
    df |> 
      select(hv101) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
pkpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkpr1_pre_tmp2 <- pkpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv101)) |> 
  unnest(cols = c(lookfor_hv101)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv101", .before = 2)

# Convert the tibble to flextable for easy viewing
pkpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 31: Data dictionary of relationship to head variable across the pkpr rounds

ctr_name	var_name	label_num	pkpr_1990	pkpr_2006	pkpr_2012	pkpr_2017
Pakistan	hv101	1	[1] head	[1] head	[1] Head	[1] head
Pakistan	hv101	2	[2] wife or husband	[2] wife or husband	[2] Wife or husband	[2] wife or husband
Pakistan	hv101	3	[3] son/daughter	[3] son/daughter	[3] Son/daughter	[3] son/daughter
Pakistan	hv101	4	[4] son/daughter-in-law	[4] son/daughter-in-law	[4] Son/daughter-in-law	[4] son/daughter-in-law
Pakistan	hv101	5	[5] grandchild	[5] grandchild	[5] Grandchild	[5] grandchild
Pakistan	hv101	6	[6] parent	[6] parent	[6] Parent	[6] parent
Pakistan	hv101	7	[7] parent-in-law	[7] parent-in-law	[7] Parent-in-law	[7] parent-in-law
Pakistan	hv101	8	[8] brother/sister	[8] brother/sister	[8] Brother/sister	[8] brother/sister
Pakistan	hv101	9	[9] co-spouse	[9] co-spouse		[9] co-spouse
Pakistan	hv101	10	[10] other relative	[10] other relative	[10] Other relative	[10] other relative
Pakistan	hv101	11	[11] adopted/foster child	[11] adopted/foster child	[11] Adopted/foster child	[11] adopted/foster child
Pakistan	hv101	12	[12] not related	[12] not related	[12] Not related	[12] not related
Pakistan	hv101	13		[13] niece/nehew by blood	[13] Niece/nephew by blood	[13] niece/nephew by blood
Pakistan	hv101	14		[14] niece/nephew by marriage		[14] niece/nephew by marriage
Pakistan	hv101	15			[15] Brother/sister-in-law	[15] brother/sister in law
Pakistan	hv101	16			[16] Grand parents	[16] grand parents
Pakistan	hv101	17			[17] Aunts/uncle	[17] aunts/uncle
Pakistan	hv101	18			[18] Domestic servant	[18] domestic servant
Pakistan	hv101	98	[98] dk	[98] dk	[98] Don't know	[98] don't know

The above table shows that the value label texts vary across the pkpr rounds. To harmonize the relationship to head variable we can use the following value labels -

1 head
2 spouse
3 child
4 child-in-law
5 grandchild
6 parent
7 parent-in-law
8 sibling
9 others

Here, we merge the “spouse” and “co-spouse” categories into “spouse” category, and the “son/daughter” and “adopted/foster child” categories into “child” category.

hv102 - de jure/usual resident

Next, we check the value labels of the de jure resident variable. This means if a household member is an usual resident of the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
pkpr1_pre_tmp1 <- pkpr1_pre_tmp0 |> 
  mutate(lookfor_hv102 = map(pkpr_data, \(df) {
    df |> 
      select(hv102) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
pkpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkpr1_pre_tmp2 <- pkpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv102)) |> 
  unnest(cols = c(lookfor_hv102)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv102", .before = 2)

# Convert the tibble to flextable for easy viewing
pkpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 32: Data dictionary of the De jure resident variable across the pkpr rounds

ctr_name	var_name	label_num	pkpr_1990	pkpr_2006	pkpr_2012	pkpr_2017
Pakistan	hv102	0	[0] no	[0] no	[0] No	[0] no
Pakistan	hv102	1	[1] yes	[1] yes	[1] Yes	[1] yes

The above table shows that hv102 has the same value label texts and codes across the pkpr rounds. Therefore, we can use this variable directly after converting to factor type.

hv103 - de facto resident

Next, we check the value labels of the de facto resident variable. In DHS this means if a household member slept last night in the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
pkpr1_pre_tmp1 <- pkpr1_pre_tmp0 |> 
  mutate(lookfor_hv103 = map(pkpr_data, \(df) {
    df |> 
      select(hv103) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
pkpr1_pre_tmp1

# Now we unnest the tibble and refine the pooled data dictionary
pkpr1_pre_tmp2 <- pkpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv103)) |> 
  unnest(cols = c(lookfor_hv103)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "pkpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv103", .before = 2)

# Convert the tibble to flextable for easy viewing
pkpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()

Table 33: Data dictionary of the De facto resident variable across the pkpr rounds

ctr_name	var_name	label_num	pkpr_1990	pkpr_2006	pkpr_2012	pkpr_2017
Pakistan	hv103	0	[0] no	[0] no	[0] No	[0] no
Pakistan	hv103	1	[1] yes	[1] yes	[1] Yes	[1] yes

The above table shows that hv103 has the same value label texts and codes across the pkpr rounds. Therefore, we can use this variable directly after converting to factor type.

START FROM HERE

TASK:

Handling multiple births in death scarring vars may not be necessary.
Preceding birth interval construction has changed with DHS-7. We could re-construct it.

TO BE CONTINUED …

Getting started

Document introduction

Data import

Pakistan BR dataset use for variable creation

Checking the Women’s weight variable before harmonization

TBD

Checking the ID variables before harmonization

Numeric ID variables check

Labelled ID variables check

Checking the Birth History variables before harmonization

b0 - child is twin

b4 - sex of child

b5 - child is alive

b6 - age at death

b9 - child lives with whom

b10 - completeness of information

Checking the Common independent variables before harmonization

v106 - Mother’s education level

v011 - Date of birth (in CMC)

v501 - Mother’s marital status

v701 - Husband/Partner’s education level

v025 - Type of place of residence

v151 - Sex of household head

v152 - Age of household head

Checking the Social group variables before harmonization

Correcting year-related variables

Pakistan HH dataset use for variable creation

Checking the ID variables before harmonization

Checking HH-level variables before harmonization

Wealth index quintile variable

Pakistan PR dataset use for family structure variables creation

Checking the ID variables before harmonization

Checking Family structure variables before harmonization

hv101 - Relationship to head

hv102 - de jure/usual resident

hv103 - de facto resident

START FROM HERE