---
title: "Data Dictionary & Use Cases"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{unaids-data-dictionary}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
   eval=T, 
  echo=T
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

## Introduction

This vignette serves as a data dictionary for the UNAIDS Estimates Data pulled from the [EDMS](https://edms.unaids.org/). It will cover the data structure and variables, as well as provide an overview of the indicators included. Finally, it will touch briefly on common use cases and which data tab to access for those analytics.

```{r setup, message = FALSE}
library(mindthegap)
library(dplyr)
```

## UNAIDS Clean Data Structure

Let's first take a look at the data structure.

```{r, message = FALSE}
df_unaids <- load_unaids(pepfar_only = TRUE)
glimpse(df_unaids)

```

The data are a panel dataset of countries from 1990 to `r mindthegap:::unaids_year` reporting on `r unique(df_unaids$indicator) %>% length()` different indicators. Each of those indicators may be disaggregated by some combination of `age` and `sex` as seen in the table below. Its also important to mention that the values are provided as estimates, which have lower and upper bounds.

Estimates may be been reported with character values, e.g. "\<100" or "\>98". In order to analyze and plot this data, we have removed the character values and noted them in the `estimate_flag` column.

We have also provided a few additional columns to the dataset:

-   `pepfar`: a logical value that notes whether the country is a PEPFAR country or not
-   `indicator_type`: a character value that tells you whether the estimate is an integer, percentage, rate, or ratio
-   `achv_95_plhiv` and `achv_95_relative`: a logical value that identifies whether the country in a given year has achieved all three 95 goals (with PLHIV or relative base, see \linkVignette{`relative-plhiv-base`}) with the point estimate. This achievement is broken down by each age and sex group (All, 0-15, 15+, Female/15+ Male/15+)
-   `achv_epi_control`: a logical value that identifies whether a country in a given year has achieved all three requirements for epidemic control (IMR less than 1 and both new infections and deaths declining, see \linkVignette{`epi_control_plots`}).

```{r, message = FALSE, echo = FALSE}
tbl <- indicator_map %>% 
  dplyr::full_join(expected_ind %>% 
                     dplyr::mutate(indicator_edms = ifelse(indicator_edms == "Incidence mortality ratio",
                                                           "Incidence mortality ratio (IMR)", indicator_edms)), 
                   by = "indicator_edms") %>% 
  mindthegap:::standarize_agesex() %>% 
  mindthegap:::apply_indicator_type() %>% 
  dplyr::arrange(indicator_type) %>% 
  dplyr::distinct(indicator_type, indicator, age, sex)


tbl <- tbl %>% 
  dplyr::mutate(disaggs = dplyr::case_when(age == "All" & sex == "All" ~ "All",
                                     age == "All" ~ sex,
                                     sex == "All" ~ age,
                                     TRUE ~ stringr::str_glue("{sex}/{age}")),
                disaggs = factor(disaggs, c("All", "0-14", "15+", "15-49", "Female/15+", "Male/15+", 
                                "Female/15-49", "Female/15-24", "Male/15-24",
                                "Female", "Male"))) %>% 
  dplyr::arrange(indicator_type, indicator, disaggs) %>% 
  dplyr::group_by(indicator_type, indicator) %>% 
  dplyr::summarise(disaggs = paste(disaggs, collapse = ", "),
                   .groups = "drop")

type_cnt <- dplyr::count(tbl, indicator_type)


tbl %>% 
  dplyr::select(-indicator_type) %>% 
  dplyr::mutate(indicator = stringr::str_glue("`{indicator}`")) %>% 
  knitr::kable(format = "html",
               col.names = c("", "")) %>% 
  kableExtra::group_rows(index = c("Integer Indicators" = type_cnt$n[type_cnt$indicator_type=="Integer"], 
                                   "Percent Indicators" = type_cnt$n[type_cnt$indicator_type=="Percent"],
                                   "Rate Indicators" = type_cnt$n[type_cnt$indicator_type=="Rate"],
                                   "Ratio Indicators" = type_cnt$n[type_cnt$indicator_type=="Ratio"]
                                   )) %>% 
  kableExtra::kable_styling(bootstrap_options = c("hover"))
```

As you can see, there are multiple different disaggregates, especially across indicators. For instance, if you were looking at **"Number AIDS Related Deaths"**, there is total and age disaggregates for 0-14 and 15+, but its not broken down by sex.

As such, it is important to filter for the age/sex/indicator type disaggregates that are you interested in before diving further into your analytics to avoid any potential double counting.