import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import neonutilities as nu
import numpy as np
import pandas as pd
import requests
import seaborn as sns
Make Training Data for Species Modeling from NEON TOS Vegetation Structure Data
This notebook demonstrates how to generate a training dataset consisting of tree species, family, and location from the NEON Terrestrial Observation System (TOS) Vegetation Structure data product DP1.10098.001. We will use data from the Smithsonian Environmental Research Center (SERC) site in Maryland. In a subsequent tutorial titled Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, we will use this training dataset to train a random forest machine learning model that predicts tree families from the hyperspectral signatures obtained from the airborne remote sensing data. These two tutorials outline a relatively simple modeling example, and represent a starting point for conducting machine learning analyses using NEON data!
Set Up Python Environment
To run this notebook, you will need the following Python packages, which can be installed using !pip install
or !conda install
from within the notebook. Note that to use the neonutilities
package, you will need Python version 3.9 or higher.
- matplotlib
- neonutilities
- numpy
- pandas
- requests
- seaborn
Create a NEON AOP Token
- NEON API Token (optional, but strongly recommended), see NEON API Tokens Tutorial for more details on how to create and set up your token in Python (and R). Once you create your token (on the NEON User Accounts) page, this notebook will show you how to set it as an environment variable and use it for downloading AOP data.
Learning Objectives
- Use the
neonutilities
load_by_product
function to read in NEON vegetation structure data at a given site - Use the NEON locations API to determine the geographic position of the vegetation records in UTM x, y coordinates
- Filter the datset to include only the latest data and columns of interest
- Filter the data geospatially to keep data that are within a single AOP 1 km x 1 km tile
Disclaimer: this notebook is intended to provide an example of how to create an initial training data set for pairing with remote sensing data, and to conduct some exploratory analysis of the vegetation structure data. This does not incorporate outlier detection and removal, or comprehensive pre-processing steps. As part of creating a machine learning model, it is important to assess the training data quality and look for outliers or other potential data quality issues which may impact model results. Refer to the Compare tree height measured from the ground to a Lidar-based Canopy Height Model lesson (the first additional resource above) for more details on how you would address geographic mismatch between the AOP and TOS data.
Additional Resources
The lesson Compare tree height measured from the ground to a Lidar-based Canopy Height Model is another example of linking ground to airborne data, and shows similar steps of pre-processing TOS woody vegetation data.
The paper Individual canopy tree species maps for the National Ecological Observatory Network outlines methods for large-scale classification using NEON data. The associated NEON Science Seminar Harnessing NEON to enable the future of forest remote sensing may be a useful resource. This talk provides a high-level overview of modeling approaches for tree crown delineation and tree classification using NEON airborne remote sensing data. You can also watch the video below.
- Refer to the Vegetation Structure User Guide for more details on this data product, and to better understand the data quality flags, the sampling.
1. Download and Explore Vegetation Structure Data (DP1.10098.001)
In this first section we’ll load the vegetation structure data, find the locations of the mapped trees, and join to the species and family observations.
Let’s get started! First, import the required Python packages.
Set up your NEON token. See the setup instructions at the beginning of the tutorial on how to set up a NEON user account and create a token, if you have not already done so.
# copy and paste your NEON token from your NEON user account page here
="" my_token
We can load the vegetation structure data using the load_by_product
function in the neonutilities
package (imported as nu
). Inputs to the function can be shown by typing help(load_by_product)
.
Refer tot e R neonUtilities cheat sheet or teh Python neonutilities documentationNEON Locations API.
= []
easting = []
northing = []
coord_uncertainty = []
elev_uncertainty for i in veg_points:
= requests.get("https://data.neonscience.org/api/v0/locations/"+i)
vres = vres.json()
vres_json "data"]["locationUtmEasting"])
easting.append(vres_json["data"]["locationUtmNorthing"])
northing.append(vres_json[= pd.DataFrame.from_dict(vres_json["data"]["locationProperties"])
props = props.loc[props["locationPropertyName"]=="Value for Coordinate uncertainty"]["locationPropertyValue"]
cu 0]])
coord_uncertainty.append(cu[cu.index[= props.loc[props["locationPropertyName"]=="Value for Elevation uncertainty"]["locationPropertyValue"]
eu 0]])
elev_uncertainty.append(eu[eu.index[
= dict(points=veg_points,
pt_dict =easting,
easting=northing,
northing=coord_uncertainty,
coordinateUncertainty=elev_uncertainty)
elevationUncertainty
= pd.DataFrame.from_dict(pt_dict)
pt_df "points", inplace=True)
pt_df.set_index(
= veg_map.join(pt_df,
veg_map ="points",
on="inner") how
Next, use the stemDistance
and stemAzimuth
data to calculate the precise locations of individuals, relative to the reference locations.
- \(Easting = easting.pointID + stemDistance*sin(\theta)\)
- \(Northing = northing.pointID + stemDistance*cos(\theta)\)
- \(\theta = stemAzimuth*\pi/180\)
Also adjust the coordinate and elevation uncertainties.
"adjEasting"] = (veg_map["easting"]
veg_map[+ veg_map["stemDistance"]
* np.sin(veg_map["stemAzimuth"]
* np.pi / 180))
"adjNorthing"] = (veg_map["northing"]
veg_map[+ veg_map["stemDistance"]
* np.cos(veg_map["stemAzimuth"]
* np.pi / 180))
"adjCoordinateUncertainty"] = veg_map["coordinateUncertainty"] + 0.6
veg_map[
"adjElevationUncertainty"] = veg_map["elevationUncertainty"] + 1 veg_map[
Look at the columns to see all the information contained in this dataset.
# look at a subset of the columns that may be relevant
'date','individualID','scientificName','taxonID','family','plotID','pointID','adjEasting','adjNorthing']].head(5) veg_map[[
len(veg_map)
3. Filter to trees within an AOP tile extent
Now create a new dataframe containing only the veg data that are within a single AOP tile (which are 1 km x 1 km in size). For this, you will need to know the bounds (minimum and maximum UTM easting and northing) of the area you are sampling. For this exercise, we will choose the AOP data with SW (lower left) UTM coordinates of 364000, 4305000. This tile encompasses the NEON tower at the SERC site.
= veg_map[(veg_map['adjEasting'].between(364000, 365000)) & (veg_map['adjNorthing'].between(4305000, 4306000))] veg_tower_tile
How many records do we have within this tile?
len(veg_tower_tile)
There are 211 unique vegetation records in this area. We can also look at the unique taxonID
s that are represented.
# look at the unique Taxon IDs
veg_tower_tile.taxonID.unique()
Let’s keep only a subset of the columns that we are interested in, and look at the dataframe:
= veg_tower_tile[['date','individualID','scientificName','taxonID','family','adjEasting','adjNorthing']]
veg_tower_tile_short =True, inplace=True)
veg_tower_tile_short.reset_index(drop veg_tower_tile_short
To get a better sense of the data, we can also look at the # of each species, to see if some species have more representation than others.
# display the taxonID counts, sorted descending
= veg_tower_tile[['individualID','taxonID']].groupby(['taxonID']).count()
veg_tower_tile_taxon_counts ='individualID',ascending=False) veg_tower_tile_taxon_counts.sort_values(by
# display the family counts, sorted descending
= veg_tower_tile[['individualID','family']].groupby(['family']).count()
veg_tower_tile_family_counts ='individualID',ascending=False) veg_tower_tile_family_counts.sort_values(by
It looks like there are a number of different species (and families) mapped in this tower plot. You can use the https://plants.usda.gov website to look up the species information. The top 5 most abundant mapped species are linked below.
- FAGR: American Beech (Fagus grandifolia Ehrh.)
- LITU: Tuliptree (Liriodendron tulipifera L.)
- LIST2: Sweetgum (Liquidambar styraciflua L.)
- ACRU: Red Maple (Acer rubrum L.)
- CAGL8: Sweet pignut hickory (Carya glabra (Mill.))
When carrying out classification, the species that only have small representation (1-5 samples) may not be modeled accurately due to a lack of sufficient training data. The challenge of mapping rarer species due to insufficient training data is well known. In the next tutorial, we will remove these poorly represented samples before generating a model.
4. Write training dataframe to csv file
Nonetheless, we have a fairly decent training dataset to work with. We can save the dataframe to a csv file called serc_training_data.csv
as follows:
r'./data/serc_training_data.csv',index=False) veg_tower_tile_short.to_csv(
5. Plot tree families in map view
Finally, we can make a quick plot using seaborn
(imported as sns
) to show the spatial distrubtion of the trees surveyed in this area, along with their species (scientificName
). Most of this code helps improve the formatting and appearance of the figure; the first sns.scatterplot
chunk is all you really need to do to plot the essentials.
= sns.scatterplot(
ax =veg_tower_tile_short,
data='adjEasting',
x='adjNorthing',
y='family',
hue
)
# Make the x and y dimensions are equal
'equal', adjustable='box')
ax.set_aspect(
# Remove scientific notation on the x and y axes labels
lambda x, _: f'{x:.0f}'))
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda y, _: f'{y:.0f}'))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(
# Place the legend outside the plot at the center right
='center left', bbox_to_anchor=(1.05, 0.5))
plt.legend(loc
# Adjust layout to prevent legend overlap
plt.tight_layout()
# Add title and axis labels
"SERC Tree Families", fontsize=14)
ax.set_title("Easting (m)", fontsize=12)
ax.set_xlabel("Northing (m)", fontsize=12)
ax.set_ylabel(=8)
plt.yticks(fontsize=8)
plt.xticks(fontsize
plt.show()
Great! We can see all the trees that were surveyed in this AOP tile. The trees are sampled in discrete plots. For more information about the TOS sampling design, please refer to the Vegetation structure data product page.
Recap
In this lesson, we have created a training data set containing information about the tree family and species as well as their geographic locations in UTM x, y coordinates. We can now pair this training data set with the remote sensing data and create a model to predict the tree’s family based off airborne spectral data. The next tutorial, Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, will show how to do this!
Note: you may wish to create a training dataframe that contains additional information about the trees. For example, you can also include parameters like the growth form (e.g. whether the vegetation is a shrub, single-bole or multi-bole tree, etc.), the plant status (whether the tree is healthy or standing dead), and measurements such as the stem diameter and tree height. To do this, you would need to join the vst_mappingandtagging
table with the vst_apparentindividual
tables. Refer to the Quick Start Guide for Vegetation Structure for more information about the data tables and the joining instructions. You can also refer to the lesson Compare tree height measured from the ground to a Lidar-based Canopy Height Model which provides an example of how to do this and compare the TOS measured data with the AOP Lidar-derived Canopy Height Model (Ecosystem Structure) data product.