---
title: "A Quick Introduction to iNEXT.4steps via Examples"
author: "Anne Chao and Kai-Hsiang Hu"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
vignette: >
%\VignetteIndexEntry{A Quick Introduction to iNEXT.4steps via Examples}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "",
fig.retina = 2,
fig.align = 'center',
fig.width = 12, fig.height = 10.5,
warning = FALSE, message = FALSE)
old <- options("width" = 200, digits = 3)
library(iNEXT.4steps)
library(ggplot2)
data("Data_spider")
data("Data_woody_plant")
```
`iNEXT.4steps` (Four-Step Biodiversity Analysis based on iNEXT) expands `iNEXT` (Chao et al. 2014) to include the estimation of sample completeness and evenness under a unified framework of Hill numbers. `iNEXT.4steps` links sample completeness, diversity estimation, interpolation and extrapolation (`iNEXT`), and evenness in a fully integrated approach. The pertinent background for the four-step methodology is provided in Chao et al. (2020). The four-step procedures are described in the following:
- **Step 1: Assessment of sample completeness profile**
Before performing biodiversity analysis, it is important to first quantify the sample completeness of a biological survey. Chao et al. (2020) generalized the conventional sample completeness to a class of measures parametrized by an order q ≥ 0. When q = 0, sample completeness reduces to the conventional measure of completeness, i.e., the ratio of the observed species richness to the true richness (observed plus undetected). When q = 1, the measure reduces to the sample coverage (the proportion of the total number of individuals in the entire assemblage that belong to detected species), a concept original developed by Alan Turing in his cryptographic analysis during WWII. When q = 2, it represents a generalized sample coverage with each species being proportionally weighted by its squared species abundance (i.e., each individual being proportionally weighted by its species abundance); this measure thus is disproportionally sensitive to highly abundant species. For a general order q ≥ 0 (not necessarily to be an integer) , the sample completeness of order q quantifies the proportion of the assemblage's individuals belonging to detected species, with each individual being proportionally weighted by the (q-1)th power of its abundance. Sample completeness profile depicts its estimate with respect to order q ≥ 0; this profile fully characterizes the sample completeness of a biological survey.
`iNEXT.4steps` features the estimated profile for all orders of q ≥ 0 based on the methodology developed in Chao et al. (2020). All estimates are theoretically between 0 and 1. If the estimated sample completeness profile is a horizontal line at the level of unity for all orders of q ≥ 0, then the survey is complete, implying there is no undetected diversity. In most applications, the estimated profile increases with order q, revealing the existence of undetected diversity. The sample completeness estimate for q = 0 provides an upper bound for the proportion of observed species; its complement represents a lower bound for the proportion of undetected species. This interpretation is mainly because data typically do not contain sufficient information to accurately estimate species richness and only a lower bound of species richness can be well estimated. By contrast, for q ≥ 1, when data are not sparse, the sample completeness value for q ≥ 1 can be very accurately estimated measures. The values for q ≥ 2 typically are very close to unity, signifying that almost all highly abundant species (for abundance data) or highly frequent species (for incidence data) had been detected in the reference sample.
- **STEP 2. Analysis of the size-based rarefaction and extrapolation sampling curves, and the asymptotic diversity profile for 0 ≤ q ≤ 2.**
**(STEP 2a)**. For each dataset, first examine the pattern of the size-based rarefaction and extrapolation sampling curve up to double the reference sample size for q = 0, 1 and 2. If the curve stays at a fixed level (this often occurs for the measures of q = 1 and 2), then our asymptotic estimate presented in Step 2b can be used to accurately infer the true diversity of the entire assemblage. Otherwise, our asymptotic diversity estimate represents only a lower bound (this often occurs for the measures of q = 0).
**(STEP 2b)**. When the true diversity can be accurately inferred, the extent of undetected diversity within each dataset is obtained by comparing the estimated asymptotic diversity profile and empirical profile; the difference in diversity between any two assemblages can be evaluated and tested for significance.
- **STEP 3. Analysis of non-asymptotic coverage-based rarefaction and extrapolation analysis for orders q = 0, 1 and 2.**
When sampling data do not contain sufficient information to accurately infer true diversity, fair comparisons of diversity across multiple assemblages should be made by standardizing the sample coverage (i.e., comparing diversity for a standardized fraction of an assemblage's individuals). This comparison can be done based on seamless integration of coverage-based rarefaction and extrapolation sampling curves up to a maximum coverage (Cmax = the minimum sample coverage among all samples extrapolated to double reference sizes).
- **STEP 4. Assessment of evenness profiles**
Chao and Ricotta (2019) developed five classes of evenness measures parameterized by an order q > 0. (For q = 0, species abundances are disregarded, so it is not meaningful to evaluate evenness among abundances specifically for q = 0. As q tends to 0, all evenness values tend to 1 as a limiting value.) All classes of evenness measures are functions of diversity and species richness, and all are standardized to the range of [0, 1] to adjust for the effect of differing species richness. Evenness profile depicts evenness estimate with respect to order q ≥ 0. Because true species richness typically cannot be accurately estimated, evenness profile typically can only be accurately measured when both diversity and richness are computed at a fixed level of sample coverage up to a maximum coverage Cmax defined in Step 3. `iNEXT.4steps` shows, by default, the relevant statistics and plot for only one class of evenness measure (based on the normalized slope of a diversity profile), but all the five classes are featured.
NOTE 1: Sufficient data are required to perform the 4-step analysis. If there are only a few species in users' data, it is likely that data are too sparse to use `iNEXT.4steps.`
NOTE 2: The analyses in STEPs 2a, 2b and 3 are mainly based on package `iNEXT` available from CRAN. Thus, `iNEXT.4steps` expands `iNEXT` to include the estimation of sample completeness and evenness.
NOTE 3: As with `iNEXT`, `iNEXT.4steps` only deals with taxonomic/species diversity. Researchers who are interested in phylogenetic diversity and functional diversity should use package `iNEXT.3D` available from CRAN and see the relevant paper (Chao et al. 2021) for methodology.
NOTE 4: `iNEXT.4steps` aims to compare within-assemblage diversity. If the goal is to assess the extent of differentiation among assemblages or to infer species compositional shift and abundance changes, users should use `iNEXT.beta3D` available from CRAN and see the relevant paper (Chao et al. 2023) for methodology.
## How to cite
If you publish your work based on results from iNEXT.4steps package, you should make references to the following methodology paper and the package:
- Chao, A., Kubota, Y., Zelený, D., Chiu, C.-H., Li, C.-F., Kusumoto, B., Yasuhara, M., Thorn, S., Wei, C.-L., Costello, M. J. and Colwell, R. K. (2020). Quantifying sample completeness and comparing diversities among assemblages. Ecological Research, 35, 292-314.
- Chao, A. and Hu, K.-H. (2024). The iNEXT.4steps package: Four-Step Biodiversity Analysis based on iNEXT. R package available from CRAN.
## SOFTWARE NEEDED TO RUN iNEXT.4STEPS IN R
- Required: [R](https://cran.r-project.org/)
- Suggested: [RStudio IDE](https://posit.co/products/open-source/rstudio/#Desktop)
## HOW TO RUN INEXT.4STEPS:
The `iNEXT.4steps` package can be downloaded from CRAN or Anne Chao's [iNEXT.4steps_github](https://github.com/AnneChao/iNEXT.4steps). For a first-time installation, some additional packages must be installed and loaded; see package manual.
```{r eval=FALSE}
## install iNEXT.4steps package from CRAN
install.packages("iNEXT.4steps")
## install the latest version from github
install.packages('devtools')
library(devtools)
install_github('AnneChao/iNEXT.4steps')
## import packages
library(iNEXT.4steps)
```
An online version of iNEXT.4steps (https://chao.shinyapps.io/iNEXT_4steps/) is also available for users without an R background.
## DATA INPUT FORMAT
### Species abundance/incidence data format
For `iNEXT.4steps` package, pertinent information on species identity (or any unique identification code) and assemblage affiliation is required to be included in the input data for running iNEXT.4steps, although species identity information is not taken into account in inferring and comparing taxonomic/species diversity. Two types of species taxonomic data are supported:
1. Individual-based abundance data (`datatype = "abundance"`): When there are multiple assemblages, in addition to the assemblage/site names (as column names) and the species names (as row names), species abundance data (reference sample) can be input as a species (in rows) by assemblage (in columns) matrix/data.frame or a list of species abundance vectors. In the special case that there is only one assemblage, all data should be read in one column.
2. Sampling-unit-based incidence data, i.e., Incidence or occurrence data (`datatype = "incidence_raw"`): for each assemblage, input data for a reference sample consist of a species-by-sampling-unit matrix, in addition to the sampling-unit names (as column names) and the species names (as row names). When there are N assemblages, input data consist of N lists of matrices, and each matrix is a species-by-sampling-unit matrix. Each element in the incidence raw matrix is 1 for a detection, and 0 for a non-detection. Input a matrix which combines data for all assemblages is allowed, but the argument `nT` must be specified to indicate the number of sampling units in each assemblage.
For example, the dataset `Data_spider` included in the `iNEXT.4steps` package consists of species sample abundances of two assemblages/sites: "Open" and "Closed". Run the following code to view the data (only the first 15 rows are shown below).
```{r eval=FALSE}
data("Data_spider")
Data_spider
```
```{r echo=FALSE}
data("Data_spider")
Data_spider[1:15,]
```
We use incidence raw data (`Data_woody_plant`) collected from two sites, `"Monsoon"` and `"Upper_cloud"`, as an example. Run the following code to view the data (only the first 6 rows and first 3 columns for each site are shown below).
```{r eval=FALSE}
data("Data_woody_plant")
Data_woody_plant
```
```{r echo=FALSE}
data("Data_woody_plant")
rownames(Data_woody_plant$Upper_cloud)[1] = "Abelia_chinensis_R._Br._var._ionandra"
lapply(Data_woody_plant, function(x) x[1:6, 1:3])
```
## MAIN FUNCTION iNEXT4steps()
We first describe the main function `iNEXT4steps()` with default arguments:
```{r eval=FALSE}
iNEXT4steps(data, q = seq(0, 2, 0.2), datatype = "abundance",
nboot = 30, conf = 0.95, nT = NULL, details = FALSE)
```
The arguments of this function are briefly described below, and will explain details by illustrative examples in later text.
data |
(a) For `datatype = "abundance"`, data can be input as a vector of species abundances (for a single assemblage), matrix/data.frame (species by assemblages), or a list of species abundance vectors. \cr
(b) For `datatype = "incidence_raw"`, data can be input as a list of matrix/data.frame (species by sampling units); data can also be input as a matrix/data.frame by merging all sampling units across assemblages based on species identity; in this case, the number of sampling units (nT, see below) must be input.
|
q |
a numerical vector specifying the orders of q that will be used to compute sample completeness and evenness as well as plot the relevant profiles. Default is seq(0, 2, by = 0.2) . |
datatype |
data type of input data: individual-based abundance data (`datatype = "abundance"`) or species by sampling-units incidence matrix (`datatype = "incidence_raw"`) with all entries being 0 (non-detection) or 1 (detection) |
nboot |
a positive integer specifying the number of bootstrap replications when assessing sampling uncertainty and constructing confidence intervals. Enter 0 to skip the bootstrap procedures. Default is 30. |
conf |
a positive number < 1 specifying the level of confidence interval. Default is 0.95. |
nT |
(required only when datatype = "incidence_raw" and input data in a single matrix/data.frame) a vector of positive integers specifying the number of sampling units in each assemblage. If assemblage names are not specified (i.e., names(nT) = NULL ), then assemblages are automatically named as "Assemblage1", "Assemblage2",..., etc. |
details |
a logical variable to indicate whether the detailed numerical values for each step are displayed. Default is `FALSE`. |
The output of `iNEXT4steps` will have three parts (if `details = TRUE`): `$summary`, `$figure`, and `$details`. It may take some time to compute when data size is large or `nboot` is large.
## iNEXT.4steps VIA EXAMPLES
First, we use the data `Data_spider` to illustrate the complete 4-step analysis.
### EXAMPLE 1: Complete 4 steps for abundance data
In the spider data, species abundances of epigeal spiders were recorded in two forest stands ("closed" and "open"). In the open forest, there were 1760 individuals representing 74 species, whereas in the closed forest, there were 1411 individuals representing 44 species. In the pooled habitat, a total of 3171 individuals belonging to 85 species are recorded.
Run the following code to obtain the numerical output and six figures including five individual figures (for STEPs 1, 2a, 2b, 3 and 4, respectively) and a complete set of five plots. (Here only show the complete set of five plot; all five individual plots are omitted.)
```{r align="center",out.width="100%"}
data(Data_spider)
Four_Steps_out1 <- iNEXT4steps(data = Data_spider, datatype = "abundance")
Four_Steps_out1$summary
Four_Steps_out1$figure[[6]]
```
`$summary`: numerical tables for STEPs 1, 2b, 3 and 4.
* `Assemblage` = the assemblage names.
* `qTD` = 'Species richness' represents the taxonomic diversity of order q=0; 'Shannon diversity' represents the taxonomic diversity of order q=1, 'Simpson diversity' represents the taxonomic diversity of order q=2.
* `TD_obs` = the observed taxonomic diversity value of order q.
* `TD_asy` = the estimated asymptotic diversity value of order q.
* `s.e.` = the bootstrap standard error of the estimated asymptotic diversity of order q.
* `qTD.LCL`, `qTD.UCL` = the bootstrap lower and upper confidence limits for the estimated asymptotic diversity of order q at the specified level in the setting (with a default value of 0.95).
* `Pielou J'` = a widely used evenness measure based on Shannon entropy.
`$figure`: six figures including five individual figures (for STEPS 1, 2a, 2b, 3 and 4 respectively) and a complete set of five plots.
`$details`: (only when `details = TRUE`). The numerical output for plotting all figures.
### EXAMPLE 2: Complete 4 steps for incidence data
In the “Woody_plant” data, species incidence-raw data were recorded in two forest vegetation types (“Monsoon” and “Upper_cloud” forest). In the monsoon forest, 329 species and 6814 incidences were recorded in 191 plots. In the upper cloud forest, 239 species and 3371 incidences were recorded in 153 plots (each 20×20-m plot is regarded as a sampling unit). Because spatial clustering prevails in woody plants, individual plants cannot be regarded as independent sampling units, violating the basic sampling assumptions for the model based on abundance data. Thus, it is statistically preferable to use incidence data to avoid this violation.
Run the following code to obtain the numerical output and six figures including five individual figures (for STEPs 1, 2a, 2b, 3 and 4, respectively) and a complete set of five plots. (Here only show the complete set of five plot; all five individual plots are omitted.)
```{r align="center", out.width="100%"}
data(Data_woody_plant)
Four_Steps_out2 <- iNEXT4steps(data = Data_woody_plant, datatype = "incidence_raw")
Four_Steps_out2$summary
Four_Steps_out2$figure[[6]]
```
## Completeness and ggCompleteness: MAIN FUNCTIONS FOR STEP 1
Function `Completeness()` computes sample completeness estimates of orders q = 0 to q = 2 in increments of 0.2 (by default), and function `ggCompleteness` is used to plot the corresponding sample completeness profiles. These two functions are specifically for users who only require sample completeness estimates and profiles.
The two functions with arguments are described below:
```{r eval=FALSE}
Completeness(data, q = seq(0, 2, 0.2), datatype = "abundance", nboot = 30,
conf = 0.95, nT = NULL)
```
```{r eval=FALSE}
ggCompleteness(output)
```
All the arguments in these two functions are the same as those in the main fnction `iNEXT4steps` for details.
### Sample completeness estimates and profiles for abundance data
Run the following code to obtain sample completeness estimates based on the abundance data `Data_spider`:
```{r}
data(Data_spider)
SC_out1 <- Completeness(data = Data_spider, datatype = "abundance")
SC_out1
```
* `Order.q`: the order of sample completeness.
* `Estimate.SC`: the estimated sample completeness of order q.
* `s.e.`: standard error of sample completeness estimate.
* `SC.LCL`, `SC.UCL`: the bootstrap lower and upper confidence limits for the sample completeness of order q at the specified level (with a default value of 0.95).
* `Assemblage`: the assemblage name.
Run the following code to plot sample completeness profiles for q between 0 to 2, along with confidence intervals.
```{r out.width="70%",fig.height=8}
ggCompleteness(SC_out1)
```
### Sample completeness estimates and profiles for incidence data
Similar procedures can be applied to incidence data `Data_woody_plant` to infer sample completeness.
```{r}
data(Data_woody_plant)
SC_out2 <- Completeness(data = Data_woody_plant, datatype = "incidence_raw")
SC_out2
```
Run the following code to plot sample completeness profiles for q between 0 to 2, along with confidence intervals.
```{r out.width="70%",fig.height=8}
ggCompleteness(SC_out2)
```
## Evenness and ggevenness: MAIN FUNCTIONS FOR STEP 4
`Evenness()` computes standardized (or observed) evenness of order q = 0 to q = 2 in increments of 0.2 (by default) based on five classes of evenness measures, and function `ggevenness` is used to plot the corresponding evenness profiles. These two functions are specifically for users who only require evenness estimates and profiles. The two functions with arguments are described below:
```{r eval=FALSE}
Evenness(data, q = seq(0, 2, 0.2), datatype = "abundance", method = "Estimated",
nboot = 30, conf = 0.95, nT = NULL, E.class = 1:5, SC = NULL)
```
```{r eval=FALSE}
ggEvenness(output)
```
There are only three arguments that are not used in the main function `iNEXT4steps`; these three arguments are described below (see `iNEXT4steps` for other arguments)
method |
a binary selection of method with "Estimated" (evenness is computed under a standardized coverage value) or "Observed" (evenness is computed for the observed data). |
E.class |
an integer vector between 1 to 5 specifying which class(es) of evenness measures are selected; default is 1:5 (select all five classes). |
SC |
(required only when method = "Estimated" ) a standardized coverage value for calculating estimated evenness. If SC=NULL , then this function computes the diversity estimates for the minimum sample coverage among all samples extrapolated to double reference sizes (Cmax). |
Two simple examples for demonstrating functions `Evenness` and `ggEvenness` are given below.
### Evenness estimates and profiles for abundance data with default standardized coverage value (Cmax)
The dataset `Data_spider` is used to estimate evenness at the default standardized sample coverage (`SC = NULL`). Evenness estimates of order q = 0 to q = 2 in increments of 0.2 (by default) will be computed based on five classes of evenness measures. (For q = 0, species abundances are disregarded, so it is not meaningful to evaluate evenness among abundances specifically for q = 0. As q tends to 0, all evenness values tend to 1 as a limiting value.) Function `ggevenness` is used to plot the corresponding evenness profiles. Here only the evenness estimates are shown for the first class of evenness measures. The corresponding numerical tables for the other four classes of evenness measures are omitted. Users also can use argument `E.class` to specify which class (e.g., `E.class= 3`) or classes are required.
```{r, eval=FALSE}
data(Data_spider)
Even_out1_est <- Evenness(data = Data_spider, datatype = "abundance",
method = "Estimated", SC = NULL, E.class = 1:5)
Even_out1_est
```
```{r, echo=FALSE}
data(Data_spider)
Even_out1_est <- Evenness(data = Data_spider, datatype = "abundance",
method = "Estimated", SC = NULL, E.class = 1:5)
Even_out1_est[1]
```
* `Order.q`: the order of evenness.
* `Evenness`: the computed evenness value of order q.
* `s.e.`: standard error of evenness value.
* `Even.LCL`, `Even.UCL`: the bootstrap lower and upper confidence limits for the evenness of order q at the specified level (with a default value of `0.95`).
* `Assemblage`: the assemblage name.
* `Method`: `"Estimated"` or `"Observed"`.
* `SC`: the standardized coverage value under which evenness values are computed (only for `method = "Estimated"`)
The following commands plot the evenness profiles for all five classes of even measures, along with their confidence intervals.
```{r out.width="100%",fig.height=8}
ggEvenness(Even_out1_est)
```
### Evenness estimates and profiles for incidence data with user's specified coverage value of 0.98.
In the function `Evenness`, users can specify a particular sample coverage value under which all five classes of evenness measures will be computed. For example, instead of using the default standardized coverage value, if users want to compute evenness estimates for 0.98 based on the incidence dataset `Data_woody_plant`, then the argument `SC=0.98` is used instead of `SC=NULL`, as shown below. Here only the evenness estimates are displayed for the first class of evenness measures. The corresponding numerical tables for the other four classes of evenness measures are omitted.
```{r, eval=FALSE}
data(Data_woody_plant)
Even_out2_est <- Evenness(data = Data_woody_plant, datatype = "incidence_raw",
method = "Estimated", SC = 0.98, E.class = 1:5)
Even_out2_est
```
```{r, echo=FALSE}
data(Data_spider)
Even_out2_est <- Evenness(data = Data_woody_plant, datatype = "incidence_raw",
method = "Estimated", SC = 0.98, E.class = 1:5)
Even_out2_est[1]
```
The following commands plot the evenness profiles for five classes, along with their confidence intervals.
```{r out.width="100%",fig.height=8}
ggEvenness(Even_out2_est)
```
```{r, include = FALSE}
options(old)
```
## License
The iNEXT.4steps package is licensed under the GPLv3. To help refine `iNEXT.4steps`, your comments or feedback would be welcome (please send them to Anne Chao or report an issue on the iNEXT.4steps github [iNEXT.4steps_github](https://github.com/AnneChao/iNEXT.4steps).
## References
- Chao, A., Gotelli, N. G., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K. and Ellison, A. M. (2014). Rarefaction and extrapolation with Hill numbers: a framework for sampling and
estimation in species biodiversity studies. Ecological Monographs 84, 45-67.
- Chao, A., Henderson, P. A., Chiu, C.-H., Moyes, F., Hu, K.-H., Dornelas, M and Magurran, A. E. (2021). Measuring temporal change in alpha diversity: a framework integrating taxonomic, phylogenetic and functional diversity and the iNEXT.3D standardization. Methods in Ecology and Evolution, 12, 1926-1940.
- Chao, A., Kubota, Y., Zelený, D., Chiu, C.-H., Li, C.-F., Kusumoto, B., Yasuhara, M., Thorn, S., Wei, C.-L., Costello, M. J. and Colwell, R. K. (2020). Quantifying sample completeness and comparing diversities among assemblages. Ecological Research, 35, 292-314.
- Chao, A. and Ricotta, C. (2019). Quantifying evenness and linking it to diversity, beta diversity, and similarity. Ecology, 100(12), e02852.
- Chao, A., Thorn, S., Chiu, C.-H., Moyes, F., Hu, K.-H., Chazdon, R. L., Wu, J., Magnago, L. F. S., Dornelas, M., Zelený, D., Colwell, R. K., and Magurran, A. E. (2023). Rarefaction and extrapolation with beta diversity under a framework of Hill numbers: the iNEXT.beta3D standardization. Ecological Monographs e1588.