Make Training Data for Species Modeling from NEON TOS Vegetation Structure Data

Create a training dataset for tree classification using TOS vegetation structure data.
Author

Bridget Hass

This notebook demonstrates how to generate a training dataset consisting of tree species, family, and location from the NEON Terrestrial Observation System (TOS) Vegetation Structure data product DP1.10098.001. We will use data from the Smithsonian Environmental Research Center (SERC) site in Maryland. In a subsequent tutorial titled Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, we will use this training dataset to train a random forest machine learning model that predicts tree families from the hyperspectral signatures obtained from the airborne remote sensing data. These two tutorials outline a relatively simple modeling example, and represent a starting point for conducting machine learning analyses using NEON data!

Set Up Python Environment

To run this notebook, you will need the following Python packages, which can be installed using !pip install or !conda install from within the notebook. Note that to use the neonutilities package, you will need Python version 3.9 or higher.

  • matplotlib
  • neonutilities
  • numpy
  • pandas
  • requests
  • seaborn

Create a NEON AOP Token

  • NEON API Token (optional, but strongly recommended), see NEON API Tokens Tutorial for more details on how to create and set up your token in Python (and R). Once you create your token (on the NEON User Accounts) page, this notebook will show you how to set it as an environment variable and use it for downloading AOP data.

Learning Objectives

  • Use the neonutilities load_by_product function to read in NEON vegetation structure data at a given site
  • Use the NEON locations API to determine the geographic position of the vegetation records in UTM x, y coordinates
  • Filter the datset to include only the latest data and columns of interest
  • Filter the data geospatially to keep data that are within a single AOP 1 km x 1 km tile

Disclaimer: this notebook is intended to provide an example of how to create an initial training data set for pairing with remote sensing data, and to conduct some exploratory analysis of the vegetation structure data. This does not incorporate outlier detection and removal, or comprehensive pre-processing steps. As part of creating a machine learning model, it is important to assess the training data quality and look for outliers or other potential data quality issues which may impact model results. Refer to the Compare tree height measured from the ground to a Lidar-based Canopy Height Model lesson (the first additional resource above) for more details on how you would address geographic mismatch between the AOP and TOS data.

Additional Resources

1. Download and Explore Vegetation Structure Data (DP1.10098.001)

In this first section we’ll load the vegetation structure data, find the locations of the mapped trees, and join to the species and family observations.

Let’s get started! First, import the required Python packages.

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import neonutilities as nu
import numpy as np
import pandas as pd
import requests
import seaborn as sns

Set up your NEON token. See the setup instructions at the beginning of the tutorial on how to set up a NEON user account and create a token, if you have not already done so.

# copy and paste your NEON token from your NEON user account page here
my_token=""

We can load the vegetation structure data using the load_by_product function in the neonutilities package (imported as nu). Inputs to the function can be shown by typing help(load_by_product).

Refer tot e R neonUtilities cheat sheet or teh Python neonutilities documentationNEON Locations API.

easting = []
northing = []
coord_uncertainty = []
elev_uncertainty = []
for i in veg_points:
    vres = requests.get("https://data.neonscience.org/api/v0/locations/"+i)
    vres_json = vres.json()
    easting.append(vres_json["data"]["locationUtmEasting"])
    northing.append(vres_json["data"]["locationUtmNorthing"])
    props = pd.DataFrame.from_dict(vres_json["data"]["locationProperties"])
    cu = props.loc[props["locationPropertyName"]=="Value for Coordinate uncertainty"]["locationPropertyValue"]
    coord_uncertainty.append(cu[cu.index[0]])
    eu = props.loc[props["locationPropertyName"]=="Value for Elevation uncertainty"]["locationPropertyValue"]
    elev_uncertainty.append(eu[eu.index[0]])

pt_dict = dict(points=veg_points, 
               easting=easting,
               northing=northing,
               coordinateUncertainty=coord_uncertainty,
               elevationUncertainty=elev_uncertainty)

pt_df = pd.DataFrame.from_dict(pt_dict)
pt_df.set_index("points", inplace=True)

veg_map = veg_map.join(pt_df, 
                     on="points", 
                     how="inner")

Next, use the stemDistance and stemAzimuth data to calculate the precise locations of individuals, relative to the reference locations.

  • \(Easting = easting.pointID + stemDistance*sin(\theta)\)
  • \(Northing = northing.pointID + stemDistance*cos(\theta)\)
  • \(\theta = stemAzimuth*\pi/180\)

Also adjust the coordinate and elevation uncertainties.

veg_map["adjEasting"] = (veg_map["easting"]
                        + veg_map["stemDistance"]
                        * np.sin(veg_map["stemAzimuth"]
                                   * np.pi / 180))

veg_map["adjNorthing"] = (veg_map["northing"]
                        + veg_map["stemDistance"]
                        * np.cos(veg_map["stemAzimuth"]
                                   * np.pi / 180))

veg_map["adjCoordinateUncertainty"] = veg_map["coordinateUncertainty"] + 0.6

veg_map["adjElevationUncertainty"] = veg_map["elevationUncertainty"] + 1

Look at the columns to see all the information contained in this dataset.

# look at a subset of the columns that may be relevant
veg_map[['date','individualID','scientificName','taxonID','family','plotID','pointID','adjEasting','adjNorthing']].head(5)
len(veg_map)

3. Filter to trees within an AOP tile extent

Now create a new dataframe containing only the veg data that are within a single AOP tile (which are 1 km x 1 km in size). For this, you will need to know the bounds (minimum and maximum UTM easting and northing) of the area you are sampling. For this exercise, we will choose the AOP data with SW (lower left) UTM coordinates of 364000, 4305000. This tile encompasses the NEON tower at the SERC site.

veg_tower_tile = veg_map[(veg_map['adjEasting'].between(364000, 365000)) & (veg_map['adjNorthing'].between(4305000, 4306000))]

How many records do we have within this tile?

len(veg_tower_tile)

There are 211 unique vegetation records in this area. We can also look at the unique taxonIDs that are represented.

# look at the unique Taxon IDs
veg_tower_tile.taxonID.unique()

Let’s keep only a subset of the columns that we are interested in, and look at the dataframe:

veg_tower_tile_short = veg_tower_tile[['date','individualID','scientificName','taxonID','family','adjEasting','adjNorthing']]
veg_tower_tile_short.reset_index(drop=True, inplace=True)
veg_tower_tile_short

To get a better sense of the data, we can also look at the # of each species, to see if some species have more representation than others.

# display the taxonID counts, sorted descending
veg_tower_tile_taxon_counts = veg_tower_tile[['individualID','taxonID']].groupby(['taxonID']).count()
veg_tower_tile_taxon_counts.sort_values(by='individualID',ascending=False)
# display the family counts, sorted descending
veg_tower_tile_family_counts = veg_tower_tile[['individualID','family']].groupby(['family']).count()
veg_tower_tile_family_counts.sort_values(by='individualID',ascending=False)

It looks like there are a number of different species (and families) mapped in this tower plot. You can use the https://plants.usda.gov website to look up the species information. The top 5 most abundant mapped species are linked below.

  • FAGR: American Beech (Fagus grandifolia Ehrh.)
  • LITU: Tuliptree (Liriodendron tulipifera L.)
  • LIST2: Sweetgum (Liquidambar styraciflua L.)
  • ACRU: Red Maple (Acer rubrum L.)
  • CAGL8: Sweet pignut hickory (Carya glabra (Mill.))

When carrying out classification, the species that only have small representation (1-5 samples) may not be modeled accurately due to a lack of sufficient training data. The challenge of mapping rarer species due to insufficient training data is well known. In the next tutorial, we will remove these poorly represented samples before generating a model.

4. Write training dataframe to csv file

Nonetheless, we have a fairly decent training dataset to work with. We can save the dataframe to a csv file called serc_training_data.csv as follows:

veg_tower_tile_short.to_csv(r'./data/serc_training_data.csv',index=False)

5. Plot tree families in map view

Finally, we can make a quick plot using seaborn (imported as sns) to show the spatial distrubtion of the trees surveyed in this area, along with their species (scientificName). Most of this code helps improve the formatting and appearance of the figure; the first sns.scatterplot chunk is all you really need to do to plot the essentials.

ax = sns.scatterplot(
    data=veg_tower_tile_short,
    x='adjEasting',
    y='adjNorthing',
    hue='family',
)

# Make the x and y dimensions are equal
ax.set_aspect('equal', adjustable='box')

# Remove scientific notation on the x and y axes labels
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:.0f}'))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda y, _: f'{y:.0f}'))

# Place the legend outside the plot at the center right
plt.legend(loc='center left', bbox_to_anchor=(1.05, 0.5))

# Adjust layout to prevent legend overlap
plt.tight_layout()

# Add title and axis labels
ax.set_title("SERC Tree Families", fontsize=14)
ax.set_xlabel("Easting (m)", fontsize=12)
ax.set_ylabel("Northing (m)", fontsize=12)
plt.yticks(fontsize=8)  
plt.xticks(fontsize=8)  

plt.show()

Great! We can see all the trees that were surveyed in this AOP tile. The trees are sampled in discrete plots. For more information about the TOS sampling design, please refer to the Vegetation structure data product page.

Recap

In this lesson, we have created a training data set containing information about the tree family and species as well as their geographic locations in UTM x, y coordinates. We can now pair this training data set with the remote sensing data and create a model to predict the tree’s family based off airborne spectral data. The next tutorial, Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, will show how to do this!

Note: you may wish to create a training dataframe that contains additional information about the trees. For example, you can also include parameters like the growth form (e.g. whether the vegetation is a shrub, single-bole or multi-bole tree, etc.), the plant status (whether the tree is healthy or standing dead), and measurements such as the stem diameter and tree height. To do this, you would need to join the vst_mappingandtagging table with the vst_apparentindividual tables. Refer to the Quick Start Guide for Vegetation Structure for more information about the data tables and the joining instructions. You can also refer to the lesson Compare tree height measured from the ground to a Lidar-based Canopy Height Model which provides an example of how to do this and compare the TOS measured data with the AOP Lidar-derived Canopy Height Model (Ecosystem Structure) data product.