>  Docs Center  >  Deep Learning  >  ENVI Deep Learning Background

ENVI Deep Learning Background

ENVI Deep Learning Background

Deep learning is a more sophisticated form of machine learning that enables a system to automatically discover representations in data. What differentiates deep learning from machine learning is its ability to continually improve a prediction on its own without external guidance or intervention. Deep learning algorithms learn patterns by progressing through a series of layers in a neural network in order to draw conclusions, similar to how the brain processes information.

For remote sensing, deep learning attempts to discover spatial and spectral representations in imagery. It is often used to find features such as vehicles, utility structures, roads, and others. Deep learning models are trained to look for specific features using a set of labeled pixel data as examples. With ENVI Deep Learning, you can experiment with different parameters to achieve the best possible solution when training models.

This topic describes the overall process of using ENVI Deep Learning to extract features in images. See the following sections:


The process of extracting features from images involves several steps, as shown in the diagram below. Click on the thumbnail to see the full image. The steps are not necessarily in a linear order, and more than one iteration may be required to yield the best results.

TensorFlow models are at the core of the overall process. TensorFlow™ is an open-source library that ENVI uses to perform deep-learning tasks. A TensorFlow model is defined by an underlying set of neural network parameters. The model must be trained to look for specific features using a set of input label rasters that indicate known samples of the feature. These can come from regions of interest (ROIs) or classification images. After a TensorFlow model learns what a specific feature looks like from the label rasters, it can look for similar features in other images by classifying them with the trained model. A classification image is produced along with a class activation raster, which is a greyscale image that shows the probability of pixels belonging to the feature of interest.

The initial class activation raster may not be entirely accurate, depending on the quality of the input training samples. An optional step would be to refine the label rasters by creating ROIs of the highest pixel values, then editing the ROIs to remove false positives. The refined ROIs can be combined with the original ROIs to either train a new model again or to refine the trained model.

Training a deep learning model involves a number of stochastic processes that contain a certain degree of randomness. Training with the same parameters will yield different models. This is because of the way the algorithm tries to converge to an answer, and it is a fundamental part of the training process.

The following sections provide more detail on the common steps used in ENVI Deep Learning.

Build Label Rasters

Training a model to identify features requires one or more images containing labeled pixel data. These are called label rasters. To create a label raster, you need an input image from which to collect examples of the feature you are interested in. The images can be different sizes. Features can be two-dimensional; for example, outlines of buildings and parking lots. One-dimensional features include roads, paths, and railroad tracks. Point features include the locations of trees or stop signs.

You can label features using regions of interest (ROIs) or classification images. These options are described next.

Regions of Interest

To label features, open an image in ENVI and use the Region of Interest (ROI) Tool to draw ROIs on some of the features. The ROIs can be points, pixels, polygons, polylines, or thresholds. Then save the ROIs to an XML file. The Region of Interest (ROI) Tool is described in ENVI Help. The following figure shows an example of polygon ROIs drawn over long, blue shipping containers in an image of a seaport:

If you have a large image, it would not be feasible to collect training samples from the entire image. Instead, you can spatially subset the image, then draw ROIs on the features in the subsetted image. You can also draw ROIs on features in multiple images. The images can be different sizes. For example, you may have dozens or hundreds of UAV images that each cover a small area. Defining ROIs from multiple images can provide more general results in a deep learning model, compared to using a single image. In this case, you would create separate ROI files for each image.

If the feature you want to extract is fairly sparse in an image, draw ROIs on as many instances of the feature as you can. The more ROIs you draw, the more accurate the classification will be; although ENVI Deep Learning can still produce good results even with a limited number of training ROIs. Finally, try to draw ROIs over a variety of shapes, colors, and textures of the feature you are interested in. This will also contribute to better accuracy in the final classification.

Finally, use the Build Label Raster From ROI tool to create a label raster. If you are working with multiple images, run this tool on each image.

Classification Images

Another option is to use a classification image to train the model instead of ROIs. For example, you may have previously created a binary classification image that provides accurate locations of the feature of interest. A binary classification image only contains one class, where pixel values of 1 represent the feature of interest and values of 0 represent the background. The classification image can be created using any of ENVI's classification tools to create single-class output. Then use the Build Label Raster From Classification tool to create the label raster. You should only consider this option if the input classification raster provides an accurate representation of the feature you are interested in.

Both Build Label Raster tools create label rasters that can be passed to a deep learning model. Each label raster contains the original bands from the input image, plus an additional mask band that consists of a binary mask. The mask band indicates which pixels are in the ROI (or the class), which identifies the feature of interest. Pixel values of 1 in the mask band indicate the features, while values of 0 indicate the background pixels.

Initialize a TensorFlow Model

Before training can begin, you must set up, or initialize, a TensorFlow model. This defines the structure of the model, including the architecture, patch size, and number of bands that will be used for training.

An architecture is a set of parameters that defines the underlying convolutional neural network. The architecture that ENVI Deep Learning uses (called ENVINet5) is based on the U-Net architecture developed by Ronneberger, Fischer, and Brox (2015). Like U-Net, ENVINet5 is a mask-based, encoder-decoder architecture that classifies every pixel in the image.

A patch is a small image given to the model for training. You can specify the patch size and number of bands for the model in the Initialize ENVINet5 TensorFlow Model dialog.

Train a TensorFlow Model

Once a new model has been initialized, it must be trained on one or more label rasters. Training involves repeatedly exposing the label rasters to a model. Over time, the model will learn to translate the spectral and spatial information in the label rasters into a class activation raster that highlights the features it was shown during training. In the first training pass, the model attempts an initial guess and generates a random class activation raster, which is compared to the mask band of the label raster. Through a goodness of fit function, also called the loss function, the model learns where its random guess was wrong. The internal parameters, or weights, of the model are adjusted to make it more correct, and the label rasters are exposed to the model again.

In practice, however, data is not passed through training all at once. Instead, square patches of a given size are extracted from the label rasters and are given to training a few at a time.

The following diagram of the ENVINet5 architecture shows how a model processes a single patch. Click on the thumbnail to see the full image. The architecture has 5 "levels" and 27 convolutional layers. Each level represents a different pixel resolution in the model. This example uses a patch size of 572 x 572 and three bands. The output is a class activation raster, which is converted into a mask and compared to the mask band of the label raster. The final class activation raster has a 93-pixel border of transparent pixels around all sides.


The contextual field of view of an architecture indicates how much of the surrounding area contributes to each pixel during training. For the ENVINet5 architecture, the contextual field of view is 140 x 140 pixels. A patch size larger than the contextual field of view allows more training to occur at once and classifies faster. For the model to learn shapes larger than 140 x 140 pixels, the training rasters must be downsampled.

You can indicate how much training to perform by specifying the number of epochs, the number of patches per epoch, and the number of patches per batch. These are described next.

Epochs and Batches

In traditional deep learning, an epoch refers to passing an entire dataset through training once. In ENVI Deep Learning, however, patches are extracted from label rasters in an intelligent manner so that, at the beginning of training, areas with a high density of feature pixels are seen more often than areas with a low density. At the end of training, all areas are seen more equally. Because of this biased determination of how the patches are extracted, an epoch in ENVI Deep Learning instead refers to how many patches are trained before the bias is adjusted.

Multiple epochs are needed to adequately train a model. The number of epochs and number of patches per epoch depend on the diversity of the set of features being learned; there is no correct number. In general, there should be enough epochs for adjustment of weighting to occur smoothly; suggested values are 16 to 32. Once you specify the number of epochs, the number of patches per epoch determines how much training occurs. This number should be lower for small datasets and higher for larger datasets. Values are typically between 200 and 1000.

The training process does not usually train a single patch at a time. Multiple patches are usually used at the same time in one iteration. A batch refers to the set of training patches used in one iteration of training. Batches are run in an epoch until the number of specified patches per epoch is met or exceeded. Typically, you should specify as many patches as possible as will fit into graphic processing unit (GPU) memory. For example, with patches of size 572 x 572 and three bands, and 8 GB of GPU memory, that value is about 4.

In the Train TensorFlow Mask Model tool, you can set the Number of Epochs to run as well as the Number of Patches per Epoch and Number of Patches per Batch.

Training Parameters

ENVI uses a proprietary technique for training deep learning models that is based on a biased selection of patches. Normally, training patches are chosen with equal probability when training a TensorFlow model. If the pixels representing the feature of interest are sparse in an image, selecting patches equally throughout the image can cause the model to learn to produce a mask that consists entirely of background pixels.

To avoid this, ENVI introduces a bias so that the model will see patches with a higher density of feature pixels more often. The approach is based on a statistical technique called inverse transform sampling, where the examples shown to the model are in proportion to their contribution to a probability density function. This bias is controlled using the Class Weight parameter in the TrainTensorFlow Mask Model tool. You can set minimum and maximum values for Class Weight. The maximum value is used to bias patch selection when training begins. This value decreases to the minimum value when training ends.

In most cases, the minimum value should be 0 so that the model finishes the last epoch while seeing the actual ratio of feature to background pixels. To help determine a suitable value for the maximum, keep in mind that machine- and deep-learning applications often yield better results when the ratio of positive to negative examples is approximately 1:100.

An additional Loss Weight parameter can be used to bias the loss function to place more emphasis on correctly identifying feature pixels than identifying background pixels. This is useful when features are sparse or if not all of the features are labeled. A value of 0 means the model should treat feature and background pixels equally. Increasing the Loss Weight biases the loss function toward finding feature pixels. The useful range of values is between 0 and 3.0.

You can also set the Patch Sampling Rate parameter to indicate the density of sampling that should occur. This is the average number of patches that each pixel will belong to in the training and validation rasters. Increasing this value can be helpful when features are sparse, as there is a greater likelihood that enough patches are chosen that include the features. For smaller patch sizes, increasing the sampling value could make the model more general by oversampling feature pixels in slightly different positions. The only reason to decrease the value would be when features are dense and you do not need to cover every pixel multiple times with training patches. This can help to speed up the training time.

In addition to weighting feature and background pixels, the training process must also consider the sizes and edges of features. This is described next.

Solid Distance and Blur Distance

When labeling features for training, it can be tedious to draw polygons around the features of interest. If you are more concerned with detection and counting, as opposed to accurately capturing the shapes of features (or masking them), you can label the features with polylines or points. A Solid Distance parameter is provided to expand the size of linear and point features so that they fully represent their associated real-world objects. Take painted road centerlines as an example. When using the Region of Interest (ROI) Tool to collect samples of road centerlines, you would use polyline ROIs. Polylines have a width of one pixel, but in reality, the associated road centerlines have a finite width (approximately 10") that the TensorFlow model needs to learn.

The Solid Distance value is the number of pixels surrounding the labels, in all directions, that are also part of the target feature. You can use Solid Distance to expand the size of polygon features, but its use is limited. It is most commonly used with point and polyline features. Defining a Solid Distance value tends to work well for linear features with a fairly consistent width (like roads, road centerlines, and shipping containers) or compact features with a fairly consistent size (cars and stop signs). For example, if you add a point label to the center of a car that is about 23 pixels wide, a 10-pixel radius around the label would encompass most of the car. So the Solid Distance value would be 10.

Deep learning algorithms can have difficulty learning the sharp edges of masks in features such as buildings. Blurring the edges and decreasing the blur during training can help the model gradually focus on the feature. To control this, set the minimum and maximum Blur Distance. At the beginning of training, features are expanded with a decaying gradient from the edge of a feature (including the Solid Distance, if defined) to the maximum Blur Distance. As training progresses, the distance is gradually reduced to the minimum value.

In general, set the maximum Blur Distance so that it blurs adequately within the contextual field of view of the model. In the ENVINet5 architecture, this is 140 x 140 pixels. So a reasonable maximum blur distance should be a few pixels to no more than 70. Set the minimum Blur Distance anywhere from 0 for well-defined borders to a few pixels when feature boundaries are indistinct.

All of these parameters are used to guide the TensorFlow model to learn the features of interest. Once the TensorFlow model has been trained, it can be used to find the same features in other images.

Perform Classification

The next step is to use the trained model to locate the same features in other images. For example, you may want to find features in a much larger version of the image than the model was trained on, or even different images with similar spatial and spectral properties. Use the TensorFlow Mask Classification tool to classify an image using a trained model. This tool produces a class activation raster, which is a greyscale image whose pixels roughly represent the probability of belonging to the given feature. Bright pixels indicate high matches to the feature. The following figure shows a true-color image of a shipping yard on the left and the resulting class activation raster of blue shipping containers on the right.

The class activation raster has a 93-pixel border on all sides, where the pixels are set to values of NoData.

A raster color slice helps to visualize the pixels with the highest probability of matching the feature of interest. The TensorFlow Mask Classification topic describes how to do this. The following image shows an example, where the red pixels indicate higher matches to blue-colored containers:

If you are satisfied with the results, you can optionally use the Class Activation to Pixel ROI or Class Activation to Polygon ROI tool to create a pixel or polygon ROI file of the highest pixel values in the class activation raster. Or, use the Class Activation to Classification image to create a classification image of the highest pixel values in the class activation raster.

Optional: Refine the Results

Creating a class activation raster is the final step in the process of using ENVI Deep Learning to locate features in imagery. However, it will likely contain false positives. In the above example, the class activation raster identified several blue trucks and other objects that were not shipping containers. In most cases, you will want to refine the results and use them to improve a trained model, which can result in a more accurate classification. To do this, use the ENVI Region of Interest (ROI) Tool to edit the ROIs that were created from the Class Activation to Pixel ROI or Class Activation to Polygon ROI tool. You can use the ROI Tool to delete records of false positives, or draw new ROIs on more example features. Save the edited ROIs to a new file on disk. The refined ROIs can be combined with the original ROIs to either train a new model again or to refine the trained model.


Ronneberger, O., P. Fischer, and T. Brox. "U-Net: Convolutional Neural Networks for Biomedical Image Segmentation." In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Lecture Notes in Computer Science 9351. Springer, Cham.

© 2019 Harris Geospatial Solutions, Inc. |  Legal
My Account    |    Store    |    Contact Us