Science

How We Built a Machine Learning Model to Locate and Map Cranberry Bogs for National Geographic

Krishna Karra

Nov 28, 2019 · 4 min read

For a fun piece to chew on during the Thanksgiving holiday, National Geographic asked Descartes Labs if we could figure out a way to map every cranberry bog in the United States. We had less than a week to complete the project end-to-end, but we were fortunate to have some pre-existing algorithmic work and data products to consult, which gave us a huge head start.

In the end, the project proved to be a fun and unique problem to solve from a remote sensing perspective. Here, we’ll unpack how we actually did it!

Why was this project hard?

While it sounds simple, it turns out that identifying cranberry bogs from satellite imagery is actually quite tricky. Depending on the time of year, they might look like ponds, bare earth, or lush, green agriculture (like corn)!

cranbog
The life cycle of a cranberry bog is best observed over time. In the four images above, which depict bogs in Massachusetts, we see snow, healthy green vegetation, flooded bogs, and even flashes of red during harvest!

Unless they’re dry harvested, most cranberry bogs in North America are manually flooded before harvest. Unfortunately, the timing of the flooding has enough variability from region to region that we couldn’t get away with simply selecting a single moment in time that would allow us to identify the unique optical imagery signature that indicates a flooded bog.

However, what we could do was take advantage of the manual, pre-harvest floods that inundate these bogs by analyzing their signature in synthetic aperture radar (SAR) imagery, which is particularly sensitive to water content.

bogs
The difference between vegetated and flooded bogs (top to bottom) is evident in optical imagery, shown on the left. To the right, this difference is also evident in synthetic aperture radar imagery, with a sharp decline in backscatter once the bogs are flooded.

After analyzing the flood signature in radar imagery, we leveraged techniques we’d used in a previous Descartes Labs project that involved using radar data to map rice paddies — which are also periodically flooded — in Southeast Asia, and combined those with information available to us through the Descartes Labs platform that provides an annual composite of Sentinel-1 radar data, which — critically — includes annual statistics computed from raw amplitude backscatter data. These changes in backscatter represent multiple states in the lifecycle of a cranberry bog. Without this groundwork in place, we most likely would not have been able to accommodate the very short turn-around time for this project.

How did we do it?

Since temporally aggregated radar statistics are a somewhat abstract way to represent bogs, we applied machine learning to these data sets to find a decision surface that accurately separates the positive (cranberry bog) and negative (not a cranberry bog) class. For the positive class, we were able to locate several active cranberry bogs by scraping data from the GIS departments of Wisconsin, New Jersey and Massachusetts, which, as the top producers of cranberries in the U.S., thankfully make this information freely available.

cranberrybogs
We sampled locations of cranberry bogs obtained from state GIS data from WI, NJ and MA (left to right). The data, while incomplete, is an excellent proxy for locations that are likely to be active cranberry bogs, and accurate enough to build a machine learning classifier.

Picking samples for the negative class ended up being deceptively challenging because sampling from the most ideal regions also risked confusing the classifier. However, sampling a representative set of locations for the negative class is quite important, particularly when dealing with geospatial data, because the machine learning algorithm will otherwise not generalize very effectively. Put another way, teaching the algorithm what isn’t a cranberry bog is just as important as teaching it what is a cranberry bog.

We ended up sampling seven classes from the National Landcover Database (NLCD), ranging from urban areas, to cropland, to wetlands. Finally, we were careful to also sample this data in states with no cranberry bogs to ensure we didn’t incorrectly sample bogs in the negative class.

NCLD classes
We sampled negative examples from seven NLCD classes across multiple states to create a diverse dataset (from left to right: grasslands, wetlands, dense urban areas, medium-dense urban areas, agriculture, evergreen forest, and open water).

Next, we used our Sentinel-1 annual composite to retrieve the temporal statistics (e.g., min, max, mean, standard deviation) for each location. These statistics were formed into a feature vector on a per-sample basis, and fed into a random forest machine learning classifier. Once trained, given an arbitrary input image of SAR statistics, the model computed the probability of each pixel belonging to a cranberry bog. We leveraged the Descartes Labs platform cloud infrastructure to run this model across all cranberry-producing states and Canadian provinces in a matter of hours.

How well did we do?

rawdetector
This raw detector output is a probability image in which each pixel represents the likelihood that it belongs to a cranberry bog. Since the raw output is so noisy, we’ve animated it in a way that shows a progressive reduction of probability ranges starting with 0–100% (very noisy), and ending with 100% (clean, crisp edges). We then fade the 100% probability map into an aerial image for comparison.

The detector performs quite well! After manually inspecting the probability output, we determined that the detector consistently located every cranberry bog that can be identified by the human eye.

The following set of visuals show a few more interesting discoveries that we made while analyzing the model’s output.

bog correlation
The detector output (shown in white) strongly corresponds to actual cranberry bogs. It doesn’t show bogs that are no longer active and/or incorrectly labeled. It also finds bogs that were missing from the training data. Cranberry bog polygons, shown in yellow, were obtained from GIS data from the state of Massachusetts.
ct bog
We ran the detector over Connecticut and found the last active cranberry bog in the state!
wisco bog
Old, overgrown bogs like this one in Wisconsin do not appear to register in the model’s output.
Wisconsin bogs we detected during the fall harvest (image tiles from Sentinel-2 and Landsat 8).

boggA “peak color” composite of cranberry bogs in New Jersey over a few harvest seasons reveals stunning color visible from space.

We’re grateful to all the hardworking cranberry farmers in North America, to National Geographic for challenging us with this project, and to our colleagues at Descartes Labs who built the platform that allowed us to rise to the occasion. Who knew that rapid access to petabytes of analysis-ready data and integration, coupled with a vast array of open-source Python packages, would allow us to visualize a holiday staple in a whole new way?

If you’d like to learn more about the Descartes Labs platform and how it could help your work, please contact us here.

Happy Thanksgiving!