---
title: "Suggested quality control"
author: "Owen Jones"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    depth: 2
vignette: >
  %\VignetteIndexEntry{Suggested quality control}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


# Introduction

The specific requirements and assumptions for each function in `Rage` varies. Here we provide notes giving an overview of these requirements, which are also applicable to other functions/calculations in packages such as `popbio` and `popdemo`. We make some suggestions of how users should filter their dataset before analysis. To assist with this, the function `Rcompadre::cdb_flag` conducts a series of checks on the matrices in a `compadreDB` object and adds "flags" to facilitate the filtering out of problematic matrices. This function (`cdb_flag`) can automatically be run by `Rcompadre::cdb_fetch` using the argument `flag = TRUE`.

# Types of issue

## Missing data

The most obvious requirement for most of the `Rage` methods is that missing (`NA`) values in matrices prevent calculations using those matrices. Sometimes these `NA` values are in one of the submatrices (i.e., **U**, **F** or **C**) of the matrix model, but other submatrices are complete. For example, there may be `NA` entries in the **F** submatrix, while the **U** matrix remains complete. These issues are flagged with the columns `check_NA_A`,  `check_NA_U`,  `check_NA_F` and  `check_NA_C`.


## Excessive zeros

Submatrices composed entirely of zero values can also be problematic. There may be good biological reasons for this phenomenon. Species that do not reproduce clonally will have zero-value **C** matrices. Another biologically reasonable explanation could be that in the particular focal population in the particular focal year, there was no sexual reproduction recorded, so the **F** matrix was composed entirely of zeros. Nevertheless, zero-value submatrices can cause some calculations to fail and it may be necessary to exclude them. These issues are flagged with the columns `check_zero_F`, `check_zero_C`, `check_zero_U`. 


## Excessive survival

In a biologically reasonable matrix population model, the set of survival and growth transitions (i.e., in the **U** matrix) from a particular stage cannot exceed 1. However, in some cases, errors in the original matrices (including data entry and rounding errors) cause this situation to occur and may persist in the data set. We can check for this error using column sums of the **U** matrix, and may wish to exclude matrices with any column sum greater than 1.  
This issue is examined using the column `SurvivalIssue` which gives the maximum value of the column sums for `matU` . An additional column `check_surv_gte_1` (produced with `cdb_flag`) reports whether any **single** value is greater than or equal to 1.

## Excessive mortality

At the opposite end of the survival spectrum, there may be some matrices where some of the column sums of the **U** matrix are zero, implying that there is no survival from that particular stage. This may be a perfectly valid parameterisation for a particular year/place but is biologically unreasonable in the longer term and users may wish to exclude problematic matrices from their analysis. This issue is indicated by the column `check_zero_U_colsum`.


## Irreducibility and ergodicity 

Several matrix manipulations or calculations require that the MPM (`matA`) be irreducible and ergodic (Stott et al. 2018). Irreducible MPMs are those where parameterised transition rates facilitate pathways from all stages to all other stages. Conversely, reducible MPMs depict incomplete life cycles where pathways from all stages to every other stage are not possible. Ergodic MPMs are those where there is a *single* asymptotic stable state that does not depend on initial stage structure. Conversely, non-ergodic MPMs are those where there are multiple asymptotic stable states, which depend on initial stage structure. MPMs that are reducible and/or non-ergodic are usually biologically unreasonable, both in terms of their life cycle description and their projected dynamics. They cause some calculations in `Rage` (and elsewhere) to fail. Irreducibility is necessary but not sufficient for ergodicity.  These issues are flagged with `check_irreducible` and `check_ergodic`. Even if  `Rage` functions do not fail due to these issues, the fact that they can indicate biologically unreasonable life cycles may mean that users nevertheless wish to exclude reducible, non-ergodic matrices from their analyses.

## Singularity of the U matrix

Matrices are said to be singular if they cannot be inverted. Inversion is required for many matrix calculations and, therefore, singularity can cause some calculations to fail. This issue is flagged with `check_singular_U`. Calculations for `longevity`, `life_expect_mean`, `life_expect_var` and `net_repro_rate` fail with singular matrices, so users may wish to exclude singular matrices when conducting analyses using these functions.

## Matrix split errors

A complete MPM (**A**) can be split into its component submatrices (i.e. **U**, **F** and **C**). The sum of these submatrices should equal the complete MPM (i.e. **A** = **U** + **F** + **C**). Sometimes, however, errors occur so that the submatrices do NOT sum to **A**. Normally, this is caused by rounding errors, but more significant errors are possible. This problem is flagged with `check_component_sum` (only relevant for divided (split) matrices). We recommend that users carefully check their matrices for these errors and correct or exclude them as appropriate.


# Function requirement summaries

It is a general requirement for almost all `Rage` functions that the matrices used as arguments do not include `NA` values. With divided (split) matrices, `NA` values may be present in some submatrices but not others. For example, the **U** matrix may be complete, but the **F** matrix may have `NA` values. In this case, functions that require an **F** matrix will fail, while those that only require a **U** matrix will work.  Users should filter the data to exclude entries with `NA` values in the matrices required for their analysis. The functions `mpm_split`, `mpm_rearrange` and `mpm_standardise` do not require complete `NA`-free matrices.

For functions that use the **U** matrix, we further suggest filtering the data to exclude the biologically unreasonable entries where one or more of the `matU` columns sum to zero, or to greater than 1 (see *Excessive Survival*, above). Alternatively, users could examine the offending matrices and make sensible corrections (e.g. to correct rounding errors).

For functions that use the **F** matrix, and where sexual reproduction is known to occur in the species, we suggest that users consider filtering the data to exclude entries where **F** is entirely zero. This is not always desirable because there are some situations where zero recorded reproduction is biologically reasonable. We suggest a similar approach for the **C** matrix.

# Other issues

When using age-from-stage methods, users should be aware of the issue of convergence to quasi-stationary distribution (see \code{\link{qsd_converge}}). Briefly, All age-from-stage calculations produce age-trajectories that inevitably asymptote as a mathematical consequence of describing the vital rates as functions of discrete stages (Horvitz & Tuljapurkar, 2008). This mathematical artefact can introduce bias into measures obtained using age-from-stage methods. `Rage` provides a convenient and principled way of correcting for this artefact by imposing a lower probability threshold defined by the degree of convergence to the quasi-stationary distribution (see \code{\link{qsd_converge}}). We suggest that users filter out from their analyses matrices that do not pass this threshold criterion.

Users should also be aware of the issue of census type. For populations that reproduce in a pulse once per year. The demographic census may be carried out before or after the reproduction event. There are thus two types of census: Pre- and post-reproductive census. This distinction has potentially important implications for demographic measures because of its effects on measured population structure. For example, the fraction of individuals in the first age class will tend to be larger larger in a post-reproductive census than a pre-reproductive census. There is a column in the com(p)adre metadata (`CensusType`) that is intended to record this information but, because authors of source publications have rarely clearly stated this information, it is very incomplete. For serious analyses we therefore recommend that users carefully collect this information themselves from the source papers.

# Finally...

Although we highlight here a range of issues that could cause problems for MPM analyses we have likely inadvertently omitted some issues. We therefore urge users to carefully consider issues that may pertain to their particular analyses. 

# References

Horvitz, C. C., & Tuljapurkar, S. (2008). Stage dynamics, period survival, and mortality plateaus. The American Naturalist, 172(2), 203–215.

Stott, I., Townley, S., & Carslake, D. (2010). On reducibility and ergodicity of population projection matrix models. Methods in Ecology and Evolution. 1 (3), 242-252