MVDHS data pooling pre-checks

Getting started

Here we show the pre-requisite code sections. Run these at the outset to avoid errors. First we load the required packages.

easypackages::libraries(
  # Data i/o
  "here",                 # relative file path
  "rio",                  # file import-export
  
  # Data manipulation
  "janitor",              # data cleaning fns
  "haven",                # stata, sas, spss data io
  "labelled",             # var labelling
  "readxl",               # excel sheets
  # "scales",               # to change formats and units
  "skimr",                # quick data summary
  "broom",                # view model results
  
  # Data analysis
  "DHS.rates",            # demographic rates for dhs-like surveys
  "GeneralOaxaca",        # BO decomposition for non-linear
  "survey",               # apply survey weights
  
  # Analysis output
  "gt",
  # "modelsummary",          # output summary tables
  "gtsummary",            # output summary tables
  "flextable",            # creating tables from objects
  "officer",              # editing in office docs
  
  # R graph related packages
  "ggstats",
  "RColorBrewer",
  # "scales",
  "patchwork",
  
  # Misc packages
  "tidyverse",            # Data manipulation iron man
  "tictoc"                # Code timing
)

Next we turn off scientific notations.

options(scipen = 999)

Next we set the default gtsummary print engine for tables.

theme_gtsummary_printer(print_engine = "flextable")

Now we set the flextable output defaults.

set_flextable_defaults(
  font.size = 11,
  text.align = "left",
  big.mark = "",
  background.color = "white",
  table.layout = "autofit",
  theme_fun = theme_vanilla
)

Document introduction

Here we document the variable codes and labels of variables across all the Maldives Demographic and Health Survey (DHS) datasets. We check the variable labels and codes before running the pooling code in “daprep-v01_mvdhs.R”. We pool the following Maldives DHS surveys:

# Creating the table of surveys to be used for pooling
mvbr1_tmp_intro |> 
  mutate(n_births = prettyNum(n_births, big.mark = ",")) |> 
  select(c(ctr_name, svy_year, n_births)) |> 
  # Join vars from mvir_tmp_intro
  left_join(
    mvir1_tmp_intro |> 
      mutate(n_women = prettyNum(n_women, big.mark = ",")) |> 
      select(c(year, n_women)),
    by = join_by(svy_year == year)
  ) |> 
  # Join vars from mvhr_tmp_intro
  left_join(
    mvhr1_tmp_intro |> 
      mutate(n_households = prettyNum(n_households, big.mark = ",")) |> 
      select(svy_year, n_households),
    by = join_by(svy_year)
  ) |> 
  # Join vars from mvpr_tmp_intro
  left_join(
    mvpr1_tmp_intro |> 
      mutate(n_persons = prettyNum(n_persons, big.mark = ",")) |> 
      select(svy_year, n_persons),
    by = join_by(svy_year)
  ) |> 
  # convert nested tibble to simple tibble
  unnest(cols = c()) |> 
  mutate(
    ccode = row_number(), 
    .before = ctr_name
  ) |> 
  # convert to flextable object
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 1: Maldives DHS datasets and their sample size to be used for pooling

ccode

ctr_name

svy_year

n_births

n_women

n_households

n_persons

1

Maldives

2009

20,136

7,131

6,443

42,050

2

Maldives

2016

13,922

7,699

6,050

32,656

We use the following variables for the pooled data analysis:

  • Dependent variable
    • infantd = Index child died during infancy period (0-11 months)
  • Main Independent variable
    • sibsurv_nmv = Survival status of preceding child (Death scarring)
    • binterval_3c_nmv_opp = Birth interval preceding to index child
  • Independent variables [CHILD LEVEL]
    • cyob10y_opp = Birth cohort of index child
    • bord_c = Birth order of index child
    • sex_fm = Gender of index child
    • season = Season during birth
  • Independent variables [MOTHER/PARENT LEVEL]
    • myob_opp = Birth cohort of mother
    • macb_c_opp = Mother’s age during birth of index child
    • medu_opp = Mother’s Level of education
    • fedu_opp = Father’s level of education
  • Independent variables [HOUSEHOLD LEVEL]
    • religion = Religion
    • nat_lang = Native language of respondent
    • wi_qt_opp = Household wealth quintile
    • hhgen_2c_opp = Generations in household
    • hhstruc_opp = Household structure
    • head_sex_fm = Sex of HH head
  • Independent variables [COMMUNITY LEVEL]
    • por = Place of residence of the household
    • ecoreg = Ecological region

Note: (a) Crossed names indicates variable not included.

Data import

We will directly import the nested tibble here. The code for dataset preparation is in the “daprep-v01_mvdhs.R” script file.

# Here we import the tibbles to be used for dataset checking
# Import the mvbr nested tibble
mvbr1_pre_tmp0 <- read_rds(file = here("website_data", "mvbr1_nest0.rds"))
# Import the mvhr nested tibble
mvhr1_pre_tmp0 <- read_rds(file = here("website_data", "mvhr1_nest0.rds"))
# Import the mvpr nested tibble
mvpr1_pre_tmp0 <- read_rds(file = here("website_data", "mvpr1_nest0.rds"))

Maldives BR dataset use for variable creation

Checking the Women’s weight variable before harmonization

We will check the formatting of the v005 women’s weight variable before creating the pooled survey weight. For this we will use the labelled::look_for().

# First we create the data dictionary of v005 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v005 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v005) |> 
        look_for(details = "full") |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character() |> 
        select(-c(levels:n_na))
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v005)) |> 
  select(-pos) 
# Convert and view the tibble as flextable
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 2: Data dictionary of v005 variable across the mvbr rounds

ctr_name

svy_year

variable

label

col_type

missing

unique_values

range

Maldives

2009

v005

sample weight

dbl

0

257

99107 - 4890879

Maldives

2016

v005

women's individual sample weight (6 decimals)

dbl

0

173

162566 - 5001387

The women’s weight variables are in numeric class and have no missing values. Therefore, we need not reformat them. Hence we directly use it for preparing the pooled survey weight.

Checking the ID variables before harmonization

Here we check the formatting of the variables using which we will prepare the ID variables for the pooled Maldives birth history recode (br) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all mvbr datasets.
# First we create a data dictionary of the mvbr datasets in nested tibble.
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v001, v002, v003, bord, v021, v022, v023, v024) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and output the pooled data dictionary 
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 3: Data dictionary of variables to be used for ID creation across the mvbr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Maldives

2009

1

v001

cluster number

dbl

0

270

1 - 270

Maldives

2016

1

v001

cluster number

dbl

0

266

1 - 266

Maldives

2009

2

v002

household number

dbl

0

98

1 - 99

Maldives

2016

2

v002

household number

dbl

0

41

1 - 42

Maldives

2009

3

v003

respondent's line number

dbl

0

21

1 - 23

Maldives

2016

3

v003

respondent's line number

dbl

0

20

1 - 27

Maldives

2009

4

bord

birth order number

dbl

0

14

1 - 14

Maldives

2016

4

bord

birth order number

dbl

0

13

1 - 13

Maldives

2009

5

v021

primary sampling unit

dbl

0

270

1 - 270

Maldives

2016

5

v021

primary sampling unit

dbl

0

266

1 - 266

Maldives

2009

6

v022

sample stratum number

dbl+lbl

0

21

1 - 21

Maldives

2016

6

v022

sample strata for sampling errors

dbl+lbl

0

21

10 - 39

Maldives

2009

7

v023

sample domain

dbl+lbl

0

21

10 - 39

Maldives

2016

7

v023

stratification used in sample design

dbl+lbl

0

21

10 - 39

Maldives

2009

8

v024

region

dbl+lbl

0

6

1 - 6

Maldives

2016

8

v024

region

dbl+lbl

0

6

1 - 6

From the above we can see that v023 and v024 are of labelled class, while the rest are in numeric class. Therefore, we will check the numeric and labelled variables in different ways. Note that although survey year is a constituent ID variable we have not checked it. It is imperative that survey year would be a 4-digit variable.

Numeric ID variables check

First, let’s find out the required length of the numeric ID variables by checking the maximum values of the constituent ID variable across the Maldives DHS datasets. Here we estimate the summary stats of numeric constituent variables using skim_without_charts().

# Check the summary stats for ID vars using skimr in each mvbr dataset.
# First we estimate the summary stats using skim_without_charts().
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(
    skim_id_num = map(
      mvbr_data,
      function(df) {
        df |> 
          select(v001, v002, v003, bord, v021, v022) |> 
          skim_without_charts() |> 
          as_tibble() |> 
          select(-c(skim_type, n_missing, complete_rate)) |> 
          rename(
            variable = 1,
            mean = 2,
            sd = 3,
            min = 4,
            p25 = 5,
            p50 = 6,
            p75 = 7,
            max = 8
          )
      }
    )
  )
mvbr1_pre_tmp1

Next, we check the summary stats of numeric variables by variable name-wise.

# Now we unnest the nested tibble so that we can compare the variable length 
# across the mvbr datasets.
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(skim_id_num)) |> 
  arrange(variable, svy_year) |> 
  # change the decimal places of selected variables
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd),
    p75 = sprintf("%.0f", p75)
  )
# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 4: Summary statistics of the numeric ID variables

ctr_name

svy_year

variable

mean

sd

min

p25

p50

p75

max

Maldives

2009

bord

3.0

2.0

1

1

2

4

14

Maldives

2016

bord

2.3

1.5

1

1

2

3

13

Maldives

2009

v001

141.3

71.9

1

84

143

199

270

Maldives

2016

v001

139.7

71.7

1

79

139

201

266

Maldives

2009

v002

28.4

18.9

1

13

26

41

99

Maldives

2016

v002

13.0

7.4

1

7

13

19

42

Maldives

2009

v003

2.5

2.0

1

2

2

3

23

Maldives

2016

v003

2.4

1.8

1

1

2

3

27

Maldives

2009

v021

141.3

71.9

1

84

143

199

270

Maldives

2016

v021

139.7

71.7

1

79

139

201

266

Maldives

2009

v022

10.1

6.2

1

5

10

15

21

Maldives

2016

v022

27.3

7.9

10

22

27

34

39

Now we find out the required length of the numeric ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the numeric ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

# Processing the above nested tibble further
mvbr1_pre_tmp3 <- mvbr1_pre_tmp2 |> 
  group_by(variable) |> 
  # find the minimum and maximum values across surveys 
  summarize(
    min_val = min(min),
    max_val = max(max)
  ) |> 
  mutate(
    # calculate the num of digits in the maximum values
    max_digits = nchar(as.character(max_val)),
    # convert char var to factor
    variable = fct(
      variable, 
      levels = c("v001", "v002", "v003", "bord", "v021", "v022")
    )
  ) |> 
  # sort the rows by factor levels 
  arrange(variable) |> 
  # add variable labels and relocate it after variable name.
  bind_cols(vlabel = c("cluster number", "household number", 
                       "respondent's line number", "birth order", 
                       "primary sampling unit", "sample strata for se")) |> 
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp3 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 5: The maximum length of numeric variables to be set across the mvbr rounds for concatenating the ID variables

variable

vlabel

min_val

max_val

max_digits

v001

cluster number

1

270

3

v002

household number

1

99

2

v003

respondent's line number

1

27

2

bord

birth order

1

14

2

v021

primary sampling unit

1

270

3

v022

sample strata for se

1

39

2

Labelled ID variables check

First we check the labels in sub-national region variable coded as v024 across the mvbr datasets. Let’s create a nested tibble of v024’s value labels.

# Create the data dictionary for v024 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v024 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v024) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1

Now we view the value labels of v024 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary 
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v024)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |> 
  # Show the variable name in a col
  mutate(var_name = "v024", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 6: Data dictionary of v024 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v024

1

[1] malé

[1] malé

Maldives

v024

2

[2] north

[2] north region

Maldives

v024

3

[3] north central

[3] north central

Maldives

v024

4

[4] central

[4] central region

Maldives

v024

5

[5] south central

[5] south central

Maldives

v024

6

[6] south

[6] south region

NOTE: The sub-national region var, v024 has different label values in each survey year. It was same for mvbr 1996, 2001 and 2006. After that the label values are different for each survey round.
VERD: In this analysis, we do not use the region var in the ID var.


Secondly, we check the labels in v023 variable that denotes the stratifications used for sampling design. First we create a nested tibble of v023’s value labels.

# Create the data dictionary for v023 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v023 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v023) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1

Now we view the value labels of v023 in the table below.

# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v023)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v023", .before = 2) 

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 7: Data dictionary of v023 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v023

10

[10] male

[10] malé

Maldives

v023

20

[20] haa alif

[20] north thiladhunmathi (ha)

Maldives

v023

21

[21] haa dhaal

[21] south thiladhunmathi (hdh)

Maldives

v023

22

[22] shaviyani

[22] north miladhunmadulu (sh)

Maldives

v023

23

[23] noonu

[23] south miladhunmadulu (n)

Maldives

v023

24

[24] raa

[24] north maalhosmadulu (r)

Maldives

v023

25

[25] baa

[25] south maalhosmadulu (b)

Maldives

v023

26

[26] lhaviyani

[26] faadhippolhu (lh)

Maldives

v023

27

[27] kaafu

[27] malé atoll (k)

Maldives

v023

28

[28] alif alif

[28] north ari atoll (aa)

Maldives

v023

29

[29] alif dhaal

[29] south ari atoll (adh)

Maldives

v023

30

[30] vaavu

[30] felidhe atoll (v)

Maldives

v023

31

[31] meemu

[31] mulakatholhu (m)

Maldives

v023

32

[32] faafu

[32] north nilandhe atoll (f)

Maldives

v023

33

[33] dhaalu

[33] south nilandhe atoll (dh)

Maldives

v023

34

[34] thaa

[34] kolhumadulu (th)

Maldives

v023

35

[35] lhaamu

[35] hadhdhunmathi (l)

Maldives

v023

36

[36] gaaf alif

[36] north huvadhu atoll (ga)

Maldives

v023

37

[37] gaaf dhaal

[37] south huvadhu atoll (gdh)

Maldives

v023

38

[38] gnaviyani

[38] gnaviyani (gn)

Maldives

v023

39

[39] seenu

[39] addu atoll (s)

NOTE: The labels of v023 are different across the survey rounds.
VERD: Therefore we cannot use v023 in the ID variable preparation.

Checking the Birth History variables before harmonization

Undoubtedly the birth history variables are important for this study objective. Therefore, we need to scrutinize all the birth history variables before using them to prepare harmonized variables for the pooled dataset.

# We check the birth history vars in all mvbr datasets.
# First we create a data dictionary in nested tibble.
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |>
  mutate(lookfor_bhvars = map(
    mvbr_data,
    \(df) {
      df |> 
        select(bidx, matches("^b[0-9]+")) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_bhvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 8: Data dictionary of birth history variables across the mvbr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Maldives

2009

1

bidx

birth column number

dbl

0

14

1 - 14

Maldives

2016

1

bidx

birth column number

dbl

0

13

1 - 13

Maldives

2009

2

b0

child is twin

dbl+lbl

0

3

0 - 2

Maldives

2016

2

b0

child is twin

dbl+lbl

0

3

0 - 2

Maldives

2009

3

b1

month of birth

dbl

0

12

1 - 12

Maldives

2016

3

b1

month of birth

dbl

0

12

1 - 12

Maldives

2009

4

b2

year of birth

dbl

0

38

1972 - 2009

Maldives

2016

4

b2

year of birth

dbl

0

38

1980 - 2017

Maldives

2009

5

b3

date of birth (cmc)

dbl

0

420

867 - 1316

Maldives

2016

5

b3

date of birth (cmc)

dbl

0

425

972 - 1415

Maldives

2009

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Maldives

2016

6

b4

sex of child

dbl+lbl

0

2

1 - 2

Maldives

2009

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Maldives

2016

7

b5

child is alive

dbl+lbl

0

2

0 - 1

Maldives

2009

8

b6

age at death

dbl+lbl

18847

84

100 - 998

Maldives

2016

8

b6

age at death

dbl+lbl

13452

67

100 - 328

Maldives

2009

9

b7

age at death (months-imputed)

dbl

18836

53

0 - 360

Maldives

2016

9

b7

age at death (months, imputed)

dbl

13452

44

0 - 336

Maldives

2009

10

b8

current age of child

dbl

1300

38

0 - 36

Maldives

2016

10

b8

current age of child

dbl

470

38

0 - 36

Maldives

2009

11

b9

child lives with whom

dbl+lbl

1300

3

0 - 4

Maldives

2016

11

b9

child lives with whom

dbl+lbl

470

3

0 - 4

Maldives

2009

12

b10

completeness of information

dbl+lbl

0

7

1 - 8

Maldives

2016

12

b10

completeness of information

dbl+lbl

0

4

0 - 4

Maldives

2009

13

b11

preceding birth interval

dbl

6149

185

9 - 241

Maldives

2016

13

b11

preceding birth interval (months)

dbl

5451

193

8 - 266

Maldives

2009

14

b12

succeeding birth interval

dbl

6158

185

9 - 241

Maldives

2016

14

b12

succeeding birth interval (months)

dbl

5465

193

8 - 266

Maldives

2009

15

b13

flag for age at death

dbl+lbl

18836

6

0 - 8

Maldives

2016

15

b13

flag for age at death

dbl+lbl

13452

2

0 - 0

Maldives

2009

16

b15

live birth between births

dbl+lbl

6245

3

0 - 1

Maldives

2016

16

b15

live birth between births

dbl+lbl

5408

3

0 - 1

Maldives

2009

17

b16

child's line number in household

dbl+lbl

1300

25

0 - 23

Maldives

2016

17

b16

child's line number in household

dbl+lbl

470

24

0 - 23

Maldives

2016

18

b17

day of birth

dbl

0

31

1 - 31

Maldives

2016

19

b18

century day code of birth (cdc)

dbl

0

7851

29558 - 43060

Maldives

2016

20

b19

current age of child in months (months since birth for dead children)

dbl

0

411

0 - 432

Maldives

2016

21

b20

duration of pregnancy

dbl

10474

8

4 - 10

From the above table we get an overall snapshot of the birth history variables. We see that the variables b1-b13 are common in all the six mvbr datasets. Notably mvbr 2001 and 2006 have some extra variables that are not available in other rounds. Next, we look at the other labelled variables which are common across mvbr in more details. We would like to see if the value labels of the common birth history variables are similar across the mvbr datasets.

b0 - child is twin

We check the value labels of b0 variable that denotes whether the child is twin. First we create a nested tibble of b0’s value labels.

# Create the data dictionary for b0 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_b0 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(b0) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b0)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b0", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 9: Data dictionary of b0 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

b0

0

[0] single birth

[0] single birth

Maldives

b0

1

[1] 1st of multiple

[1] 1st of multiple

Maldives

b0

2

[2] 2nd of multiple

[2] 2nd of multiple

Maldives

b0

3

[3] 3rd of multiple

[3] 3rd of multiple

Maldives

b0

4

[4] 4th of multiple

[4] 4th of multiple

Maldives

b0

5

[5] 5th of multiple

[5] 5th of multiple

We can see the value labels of b0 in the above table. We see that the value labels are same across all the mvbr datasets.

b4 - sex of child

We check the value labels of b4 variable which gives the sex of the child. First we create a nested tibble of b4’s value labels.

# Create the data dictionary for b4 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_b4 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(b4) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b4)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b4", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 10: Data dictionary of b4 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

b4

1

[1] male

[1] male

Maldives

b4

2

[2] female

[2] female

We can see the value labels of b4 in the above table. The value labels are same across all the mvbr datasets.

b5 - child is alive

We check the value labels of b5 variable which gives the survival status of the child. First we create a nested tibble of b5’s value labels.

# Create the data dictionary for b5 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_b5 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(b5) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b5)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b5", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 11: Data dictionary of b5 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

b5

0

[0] no

[0] no

Maldives

b5

1

[1] yes

[1] yes

The above table shows that the value labels of survival status of child are same across all the mvbr datasets.

b6 - age at death

We check the value labels of b6 variable which shows the age at death of children. Note that this variable has many missing values across all mvbr rounds as not all children experienced mortality throughout their lifetime. First we create a nested tibble of b6’s value labels.

# Create the data dictionary for b5 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_b6 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(b6) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b6)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b6", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 12: Data dictionary of b6 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

b6

100

[100] died on day of birth

Maldives

b6

101

[101] days: 1

Maldives

b6

199

[199] days: number missing

Maldives

b6

201

[201] months: 1

Maldives

b6

299

[299] months: number missing

Maldives

b6

301

[301] years: 1

Maldives

b6

399

[399] years: number missing

Maldives

b6

997

[997] inconsistent

[997] inconsistent

Maldives

b6

998

[998] don't know

[998] don't know

The above table shows that the value labels of age at death of child are in two groups. First, they are same for mvbr 1996, 2001 and 2006 and and then for mvbr 2011, 2016 and 2022.

b9 - child lives with whom

We check the value labels of b9 variable which gives info on who the child lives with. First we create a nested tibble of b9’s value labels.

# Create the data dictionary for b9 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_b9 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(b9) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b9)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b9", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 13: Data dictionary of b9 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

b9

0

[0] respondent

[0] respondent

Maldives

b9

1

[1] father

[1] father

Maldives

b9

2

[2] other relative

[2] other relative

Maldives

b9

3

[3] someone else

[3] someone else

Maldives

b9

4

[4] lives elsewhere

[4] lives elsewhere

We can see in the above table that the value labels of b9 are same across all the mvbr datasets.

b10 - completeness of information

We check the value labels of b10 variable which gives the completeness of birth history information. First we create a nested tibble of b10’s value labels.

# Create the data dictionary for b10 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_b10 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(b10) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_b10)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "b10", .before = 2) 

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |>
  align(align = "left", part = "all") |> 
  autofit()
Table 14: Data dictionary of b10 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

b10

0

[0] month, year and day

Maldives

b10

1

[1] month and year

[1] month and year - information complete

Maldives

b10

2

[2] month and age -y imp

[2] month and age - year imputed

Maldives

b10

3

[3] year and age - m imp

[3] year and age - month imputed

Maldives

b10

4

[4] y & age - y ignored

[4] year and age - year ignored

Maldives

b10

5

[5] year - a, m imp

[5] year - age/month imputed

Maldives

b10

6

[6] age - y, m imp

[6] age - year/month imputed

Maldives

b10

7

[7] month - a, y imp

[7] month - age/year imputed

Maldives

b10

8

[8] none - all imp

[8] none - all imputed

We can see in the above table that the value labels of b10 are same across mvbr 1996, 2001, 2006 and 2011 datasets. Then they are same for mvbr 2016 and 2022.

Checking the Common independent variables before harmonization

Next we start documenting the common independent variables. First we will check the data dictionary of the common independent variables. Then we will check them variable wise.

# We check the common independent vars in all mvbr datasets.
# First we create the data dictionary in nested tibble.
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |>
  mutate(lookfor_comindvars = map(
    mvbr_data,
    \(df) {
      df |> 
        # select the common independent variables
        select(v106, v011, v501, v701, v025, v151, v152, v190) |> 
        lookfor(details = "full") |> 
        select(-c(levels:n_na)) |> 
        # For correctly viewing the range column in data dictionary
        convert_list_columns_to_character()
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_comindvars)) |> 
  arrange(pos)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 15: Data dictionary of common independent variables across the mvbr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Maldives

2009

1

v106

highest educational level

dbl+lbl

0

5

0 - 8

Maldives

2016

1

v106

highest educational level

dbl+lbl

0

4

0 - 3

Maldives

2009

2

v011

date of birth (cmc)

dbl

0

374

711 - 1093

Maldives

2016

2

v011

date of birth (cmc)

dbl

0

378

797 - 1193

Maldives

2009

3

v501

current marital status

dbl+lbl

0

4

1 - 5

Maldives

2016

3

v501

current marital status

dbl+lbl

0

5

0 - 4

Maldives

2009

4

v701

partner's education level

dbl+lbl

333

6

0 - 8

Maldives

2016

4

v701

husband/partner's education level

dbl+lbl

1192

6

0 - 8

Maldives

2009

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Maldives

2016

5

v025

type of place of residence

dbl+lbl

0

2

1 - 2

Maldives

2009

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Maldives

2016

6

v151

sex of household head

dbl+lbl

0

2

1 - 2

Maldives

2009

7

v152

age of household head

dbl+lbl

4

74

18 - 98

Maldives

2016

7

v152

age of household head

dbl+lbl

0

73

21 - 95

Maldives

2009

8

v190

wealth index

dbl+lbl

0

5

1 - 5

Maldives

2016

8

v190

wealth index combined

dbl+lbl

0

5

1 - 5

From the above table we get an overall snapshot of the common independent variables. We see that majority of the have different number of value labels across the six mvbr datasets. Only v025 and v151 have the same number of value labels across mvbr rounds. Next, we look at the labelled variables among these common variables in more details. We would like to see if the value labels and codes of the common independent variables are similar across the mvbr datasets.

v106 - Mother’s education level

We check the value labels of v106 variable that denotes the highest education level of mother. First we create a nested tibble of v106’s value labels.

# Create the data dictionary for v106 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v106 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v106) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v106)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v106", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 16: Data dictionary of v106 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v106

0

[0] no education

[0] no education

Maldives

v106

1

[1] primary

[1] primary

Maldives

v106

2

[2] secondary

[2] secondary

Maldives

v106

3

[3] higher

[3] higher

Maldives

v106

8

[8] unknown - certificate

We can see the value labels of v106 are mostly similar except for mvbr 1996 and 2011 datasets.

v011 - Date of birth (in CMC)

The v011 variable, which has the dob of mothers in cmc, is a numeric variable. Let’s check the range of these values in further details such as checking for outliers. First let’s create a nested tibble of the summary statistics of v011 variable.

# Create the summary statistics for v011 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(skim_v011 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v011) |> 
        skim_without_charts() |> 
        as_tibble() |> 
        select(-c(skim_type, complete_rate)) |> 
        rename(
          variable = 1,
          n_miss = 2,
          mean = 3,
          sd = 4,
          min = 5,
          p25 = 6,
          p50 = 7,
          p75 = 8,
          max = 9
        )
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(skim_v011)) |> 
  # Make variable values have one decimal point 
  mutate(
    mean = sprintf("%.1f", mean),
    sd = sprintf("%.1f", sd)
  )

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 17: Data dictionary of v011 across the mvbr rounds

ctr_name

svy_year

variable

n_miss

mean

sd

min

p25

p50

p75

max

Maldives

2009

v011

0

852.4

87.5

711

781

842

917

1093

Maldives

2016

v011

0

951.2

87.6

797

873

948

1022

1193

v501 - Mother’s marital status

We check the value labels of v501 variable which gives the current marital status of mother. First we create a nested tibble of v501’s value labels.

# Create the data dictionary for v501 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v501 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v501) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v501)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v501", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 18: Data dictionary of v501 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v501

0

[0] never married

[0] never in union

Maldives

v501

1

[1] married

[1] married

Maldives

v501

2

[2] living together

[2] living with partner

Maldives

v501

3

[3] widowed

[3] widowed

Maldives

v501

4

[4] divorced

[4] divorced

Maldives

v501

5

[5] not living together

[5] no longer living together/separated

All the mvbr rounds have 5 value labels. The mvbr 1996, 2001 and 2006 rounds have a set of similar value label texts. Then mvbr 2011, 2016 and 2022 have another set of similar value labels.

v701 - Husband/Partner’s education level

We check the value labels of v701 variable which gives the current marital status of mother. First we create a nested tibble of v701’s value labels.

# Create the data dictionary for v701 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v701 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v701) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v701)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v701", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 19: Data dictionary of v701 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v701

0

[0] no education

[0] no education

Maldives

v701

1

[1] primary

[1] primary

Maldives

v701

2

[2] secondary

[2] secondary

Maldives

v701

3

[3] higher

[3] higher

Maldives

v701

8

[8] don't know

[8] don't know

All the mvbr rounds have 5 value labels. The mvbr 1996, 2001 and 2006 rounds and mvbr 2011, 2016 and 2022 have a similar set of value labels with a difference in wording among them.

v025 - Type of place of residence

We check the value labels of v025 variable which shows if a household belongs to rural or urban psu. First we create a nested tibble of v025’s value labels.

# Create the data dictionary for v025 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v025 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v025) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_v025)) |> 
  unnest(cols = c(lookfor_v025)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v025", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 20: Data dictionary of v025 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v025

1

[1] urban

[1] urban

Maldives

v025

2

[2] rural

[2] rural

The values labels and codes for v025 are same across all the mvbr rounds.

v151 - Sex of household head

We check the value labels of v151 variable which gives the gender of the household head. First we create a nested tibble of v151’s value labels, then pivot wide and compare.

# Create the data dictionary for v151 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  mutate(lookfor_v151 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v151) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v151)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v151", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 21: Data dictionary of v151 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v151

1

[1] male

[1] male

Maldives

v151

2

[2] female

[2] female

The values labels and codes for v151 are same across all the mvbr rounds.

v152 - Age of household head

Interestingly, we see v152 (a continuous variable) has value labels for all rounds except mvbr 1996. Therefore, we check the value labels of v152 for those rounds. First we create a nested tibble of v152’s value labels.

# Create the data dictionary for v152 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  filter(svy_year != 1996) |> 
  mutate(lookfor_v152 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v152) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v152)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v152", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 22: Data dictionary of v152 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v152

97

[97] 97+

[97] 97+

Maldives

v152

98

[98] dk

[98] don't know

We can see that the value labels of v152 are mostly for missing values. However, since v152 has no missing values across the mvbr rounds, we need not be concerned about them.

v190 - Wealth quintile of household

We check the value labels of v190 variable which gives the gender of the household head. First we create a nested tibble of v190’s value labels, then pivot wide and compare.

# Create the data dictionary for v190 in nested tibble
mvbr1_pre_tmp1 <- mvbr1_pre_tmp0 |> 
  filter(svy_year != 1996) |> 
  mutate(lookfor_v190 = map(
    mvbr_data,
    \(df) {
      df |> 
        select(v190) |> 
        look_for() |> 
        lookfor_to_long_format() |> 
        select(value_labels)
    }
  ))
mvbr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvbr1_pre_tmp2 <- mvbr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(-c(unf, mvbr_data, n_births)) |> 
  unnest(cols = c(lookfor_v190)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvbr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "v190", .before = 2)

# Convert the tibble to flextable for easy viewing
mvbr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 23: Data dictionary of v190 across the mvbr rounds

ctr_name

var_name

label_num

mvbr_2009

mvbr_2016

Maldives

v190

1

[1] poorest

[1] poorest

Maldives

v190

2

[2] poorer

[2] poorer

Maldives

v190

3

[3] middle

[3] middle

Maldives

v190

4

[4] richer

[4] richer

Maldives

v190

5

[5] richest

[5] richest

The values labels and codes for v190 are same across all the mvbr rounds.

Checking the Social group variables before harmonization

Now we document the social group variables and then harmonize them. Upon manually checking the full data dictionaries of each mvbr dataset we find that there is the native language variable but it is available only for Maldives 2016 DHS datasets. Although there are religion and ethnicity variables but they have zero observations. Therefore, we cannot include any social group variables for analyzing the pooled Maldives DHS datasets.

Maldives PR dataset use for family structure variables creation

Checking the ID variables before harmonization

Here we check the formatting of the constituent variables with which we will prepare the ID variables for the pooled Maldives person recode (pr) dataset. We will use the following constituent variables for creating the ID variables for the pooled dataset:

# We check the var type of ID vars in all mvpr datasets.
# First we create a data dictionary of the mvpr datasets in nested tibble.
mvpr1_pre_tmp1 <- mvpr1_pre_tmp0 |>
  mutate(lookfor_idvars = map(mvpr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
mvpr1_pre_tmp1
# Now we unnest the tibble and output the pooled data dictionary 
mvpr1_pre_tmp2 <- mvpr1_pre_tmp1 |> 
  select(c(ctr_name, svy_year, lookfor_idvars)) |> 
  unnest(cols = c(lookfor_idvars)) |> 
  arrange(pos)

# Convert and view the tibble as flextable
mvpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 25: Data dictionary of variables to be used for ID creation across the mvpr rounds

ctr_name

svy_year

pos

variable

label

col_type

missing

unique_values

range

Maldives

2009

1

hv001

cluster number

dbl

0

270

1 - 270

Maldives

2016

1

hv001

cluster number

dbl

0

266

1 - 266

Maldives

2009

2

hv002

household number

dbl

0

99

1 - 99

Maldives

2016

2

hv002

household number

dbl

0

43

1 - 43

Maldives

2009

3

hvidx

line number

dbl

0

32

1 - 32

Maldives

2016

3

hvidx

line number

dbl

0

30

1 - 30

From the above table we can see that all the three constituent ID variables are of numeric class with no missing values. These variables can directly be used for preparing the ID variables after finding the maximum length of their largest value. Note that survey year is also a constituent ID variable of 4-digits and we need not check it.

# We thought to process the above nested tibble further by decomposing the 
# "range" col into min and max values using separate_wider_regex().
# However, we hit a roadblock as pattern did not identify the max values in 
# some mvpr rounds correctly
mvpr1_pre_tmp3 <- mvpr1_pre_tmp0 |> 
  # Generate the summary stats for id vars
  mutate(skim_idvars = map(mvpr_data, \(df) {
    df |> 
      select(hv001, hv002, hvidx) |> 
      skim_without_charts()
  })) |> 
  # Pool the summary stats for all mvpr rounds
  select(c(ctr_name, svy_year, skim_idvars)) |> 
  unnest(cols = c(skim_idvars)) |> 
  arrange(skim_variable, svy_year) |> 
  # Group and generate the max and min values for each variable
  group_by(variable = skim_variable) |> 
  summarize(
    min_val = min(numeric.p0),
    max_val = max(numeric.p100)
  ) |> 
  # calculate the num of digits in the maximum values
  mutate(
    max_digits = nchar(as.character(max_val))
  ) |>
  # add variable labels and relocate it after variable name
  bind_cols(vlabel = c("cluster number", "household number", "Persons line number")) |>
  relocate(vlabel, .after = 1)

# Convert the tibble to flextable for easy viewing
mvpr1_pre_tmp3 |>
  qflextable() |>
  align(align = "left", part = "all") |>
  autofit()
Table 26: The maximum length of constituent ID variables to be set across the mvpr rounds

variable

vlabel

min_val

max_val

max_digits

hv001

cluster number

1

270

3

hv002

household number

1

99

2

hvidx

Persons line number

1

32

2

The above table gives the required length of the constituent ID variables to be set, so that we can correctly concatenate them to create the ID variables. The required length of the ID variables are given in max_digits column. Note that survey year is also a constituent ID variable of 4-digits.

Checking Family structure variables before harmonization

Here we check the family structure related variables before harmonizing them. The variable names were collected by manually checking the full data dictionaries. Here we will check the data dictionary of these hh-level variables and focus on the variable types.

# We check the family structure vars in all mvpr datasets.
# First we create the data dictionary in nested tibble.
mvpr1_pre_tmp1 <- mvpr1_pre_tmp0 |>
  mutate(lookfor_famstrvars = map(mvpr_data, \(df) {
    df |> 
      # select the common independent variables
      select(c(hv101, hv102, hv103, hv104, hv105)) |> 
      lookfor(details = "full") |> 
      select(-c(levels:n_na)) |> 
      # For correctly viewing the range column in data dictionary
      convert_list_columns_to_character()
  }))
mvpr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary 
mvpr1_pre_tmp2 <- mvpr1_pre_tmp1 |> 
  select(c(svy_year, lookfor_famstrvars)) |> 
  unnest(cols = c(lookfor_famstrvars)) |> 
  arrange(pos, svy_year)

# Convert the tibble to flextable for easy viewing
mvpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 27: Data dictionary of family structure vars across the mvpr rounds

svy_year

pos

variable

label

col_type

missing

unique_values

range

2009

1

hv101

relationship to head

dbl+lbl

6

15

1 - 98

2016

1

hv101

relationship to head

dbl+lbl

0

12

1 - 98

2009

2

hv102

usual resident

dbl+lbl

18

3

0 - 1

2016

2

hv102

usual resident

dbl+lbl

0

2

0 - 1

2009

3

hv103

slept last night

dbl+lbl

22

3

0 - 1

2016

3

hv103

slept last night

dbl+lbl

0

2

0 - 1

2009

4

hv104

sex of household member

dbl+lbl

2

3

1 - 2

2016

4

hv104

sex of household member

dbl+lbl

0

2

1 - 2

2009

5

hv105

age of household members

dbl+lbl

538

96

0 - 98

2016

5

hv105

age of household members

dbl+lbl

0

96

0 - 95

The above table gives an overall snapshot of the family structure related variables. Interestingly, all the variables including age of hh members (a continuous var) are of labelled class. The relation to head and de facto resident variables have few missing values in mvpr 1996. Note that, the three variables of interest hv101-hv102, two variables hv101 and hv103 have different number of value labels across the mvpr rounds. Next, we compare the value labels of the individual variables across the mvpr datasets.

hv101 - Relationship to head

Next, we check the value labels of the relationship to the household head variable. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
mvpr1_pre_tmp1 <- mvpr1_pre_tmp0 |> 
  mutate(lookfor_hv101 = map(mvpr_data, \(df) {
    df |> 
      select(hv101) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
mvpr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvpr1_pre_tmp2 <- mvpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv101)) |> 
  unnest(cols = c(lookfor_hv101)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv101", .before = 2)

# Convert the tibble to flextable for easy viewing
mvpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 28: Data dictionary of relationship to head variable across the mvpr rounds

ctr_name

var_name

label_num

mvpr_2009

mvpr_2016

Maldives

hv101

1

[1] head

[1] head

Maldives

hv101

2

[2] wife or husband

[2] wife or husband

Maldives

hv101

3

[3] son/daughter

[3] son/daughter

Maldives

hv101

4

[4] son/daughter-in-law

[4] son/daughter-in-law

Maldives

hv101

5

[5] grandchild

[5] grandchild

Maldives

hv101

6

[6] parent

[6] parent

Maldives

hv101

7

[7] parent-in-law

[7] parent-in-law

Maldives

hv101

8

[8] brother/sister

[8] brother/sister

Maldives

hv101

9

[9] co-spouse

[9] co-spouse

Maldives

hv101

10

[10] other relative

[10] other relative

Maldives

hv101

11

[11] adopted/foster child

[11] adopted/foster child

Maldives

hv101

12

[12] not related

[12] not related

Maldives

hv101

13

[13] niece/nephew by blood

[13] niece/nephew by blood

Maldives

hv101

14

[14] niece/nephew by marriage

[14] niece/nephew by marriage

Maldives

hv101

98

[98] dk

[98] don't know

The above table shows that the value label texts vary across the mvpr rounds. To harmonize the relationship to head variable we can use the following value labels -

  • 1 head
  • 2 spouse
  • 3 child
  • 4 child-in-law
  • 5 grandchild
  • 6 parent
  • 7 parent-in-law
  • 8 sibling
  • 9 others

Here, we merge the “spouse” and “co-spouse” categories into “spouse” category, and the “son/daughter” and “adopted/foster child” categories into “child” category.

hv102 - de jure/usual resident

Next, we check the value labels of the de jure resident variable. This means if a household member is an usual resident of the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
mvpr1_pre_tmp1 <- mvpr1_pre_tmp0 |> 
  mutate(lookfor_hv102 = map(mvpr_data, \(df) {
    df |> 
      select(hv102) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
mvpr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvpr1_pre_tmp2 <- mvpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv102)) |> 
  unnest(cols = c(lookfor_hv102)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv102", .before = 2)

# Convert the tibble to flextable for easy viewing
mvpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 29: Data dictionary of the De jure resident variable across the mvpr rounds

ctr_name

var_name

label_num

mvpr_2009

mvpr_2016

Maldives

hv102

0

[0] no

[0] no

Maldives

hv102

1

[1] yes

[1] yes

The above table shows that hv102 has the same value label texts and codes across the mvpr rounds. Therefore, we can use this variable directly after converting to factor type.

hv103 - de facto resident

Next, we check the value labels of the de facto resident variable. In DHS this means if a household member slept last night in the household. First we create a nested tibble of the value labels.

# Create the data dictionary in nested tibble
mvpr1_pre_tmp1 <- mvpr1_pre_tmp0 |> 
  mutate(lookfor_hv103 = map(mvpr_data, \(df) {
    df |> 
      select(hv103) |> 
      look_for() |> 
      lookfor_to_long_format() |> 
      select(value_labels)
  }))
mvpr1_pre_tmp1
# Now we unnest the tibble and refine the pooled data dictionary
mvpr1_pre_tmp2 <- mvpr1_pre_tmp1 |> 
  # First we select the required cols and unnest()
  select(c(ctr_name, svy_year, lookfor_hv103)) |> 
  unnest(cols = c(lookfor_hv103)) |> 
  # Next we make the num of value labels same across each round
  mutate(label_num = parse_number(value_labels)) |> 
  complete(ctr_name, svy_year, label_num) |>
  # Next we create col of value labels for each survey round
  pivot_wider(
    names_from = svy_year, 
    values_from = value_labels,
    names_prefix = "mvpr_"
  ) |>
  # Show the variable name in a col
  mutate(var_name = "hv103", .before = 2)

# Convert the tibble to flextable for easy viewing
mvpr1_pre_tmp2 |> 
  qflextable() |> 
  align(align = "left", part = "all") |> 
  autofit()
Table 30: Data dictionary of the De facto resident variable across the mvpr rounds

ctr_name

var_name

label_num

mvpr_2009

mvpr_2016

Maldives

hv103

0

[0] no

[0] no

Maldives

hv103

1

[1] yes

[1] yes

The above table shows that hv103 has the same value label texts and codes across the mvpr rounds. Therefore, we can use this variable directly after converting to factor type.

START FROM HERE

TASK:

  • Handling multiple births in death scarring vars may not be necessary.
  • Preceding birth interval construction has changed with DHS-7. We could re-construct it.

TO BE CONTINUED …

Back to top