The U-Net : A Complete Guide (2024)

7 min read

·

Feb 1, 2024

--

The U-Net: A Complete Guide (2)

Table of Contents

  • Introduction
  • Contracting Path
  • Expanding Path
  • Up-Convolution and Channels
  • Image Example
  • Summary

The creation of the U-Net was a ground breaking discovery in the realm of image segmentation, a field focused on locating objects and boundaries within an image. This novel architecture proved to carry immense value in the analysis of biomedical images.

The U-Net is a special type of Convolutional Neural Network (CNN) and as a result, it is highly recommend to be familiar with them before delving into this article. If necessary please learn about CNNs here.

The U-Net is composed of two main components: a contracting path and an expanding path.

  • Contracting path: aims to decrease the spatial dimensions of the image, while also capturing relevant information about the image.
  • Expanding path: aims to upsample the feature map and produce a relevant segmentation map using the patterns learnt in the contracting path.

As you may notice, the U-Net in fact resembles an encoded-decoder architecture, which coincidentally makes a U shape, hence the name.

Let’s now get into more depth about each component.

Note: While reading further on, you may wonder, “How on earth is it possible to change the number of channels and what on earth is an up-convolution?!” Well, don’t worry, I have covered this at the end. And if you already know this, then feel free to skip that section.

The contracting path uses a combination of convolution and pooling layers to extract and capture features within an image, while at the same time, reducing its spatial dimensions.

Let’s now explore each of the 5 blocks in the contracting path down below.

The U-Net: A Complete Guide (3)

Block 1

  1. An input image with dimensions 572² is fed into the U-Net. This input image consists of only 1 channel, likely a grayscale channel.
  2. Two 3x3 convolution layers (unpadded) are then applied to the input image, each followed by a ReLU layer. At the same time the number of channels are increased to 64 in order to capture higher level features.
  3. A 2x2 max pooling layer with a stride of 2 is then applied. This downsamples the feature map to half its size, 284².

Block 2

  1. Just like in block 1, two 3x3 convolution layers (unpadded) are applied to the output of block 1, each followed again by a ReLU layer. At each new block the number of feature channels are doubled, now to 128.
  2. Next a 2x2 max pooling layer is again applied to the resulting feature map reducing the spatial dimensions by half to 140².

Block 3

The procedure used in block 1 and 2 is the same as in block 3, so will not be repeated.

Block 4

Same as block 3.

Block 5

  1. In the final block of the contracting path, the number of feature channels reach 1024 after being doubled at each block.
  2. This block also contains two 3x3 convolution layers (unpadded), which are each followed by a ReLU layer. However, for symmetry purposes, I have only included one layer and included the second layer in the expanding path.

After complex features and patterns have been extracted, the feature map moves on to the expanding path.

The expanding path uses both convolution and up-convolution operations to combine learnt features and upsample the input feature map until it generates a segmentation map.

Much like with the contracting path, each block will be discussed below.

The U-Net: A Complete Guide (4)

Before we read further: Skip connections are used to send images directly from the contracting path to the expanding path without them having to go through all the blocks. This allows for both high and low level features to be preserved and learnt, reducing any information loss that occurs during the contracting path.

Block 5

  1. Continuing on from the contracting path, a second 3x3 convolution (unpadded) is applied with a ReLU layer after it.
  2. Then a 2x2 convolution (up-convolution) layer is applied, upsampling the spatial dimensions twofold and also halving the number of channels to 512.

Block 4

  1. Using skip connections, the corresponding feature map from the contracting path is then concatenated, doubling the feature channels to 1024. Note that this concatenation must be cropped to match the expanding path’s dimensions.
  2. Two 3x3 convolution layers (unpadded) are applied, each with a ReLU layer following, reducing the channels to 512.
  3. After, a 2x2 convolution (up-convolution) layer is applied, upsampling the spatial dimensions twofold and also halving the number of channels to 256.

Block 3

The procedure used in block 5 and 4 is the same as in block 3, so will not be repeated.

Block 2

Same as block 3.

Block 1

  1. In the final block of the expanding path, there are 128 channels after concatenating the skip connection.
  2. Next, two 3x3 convolution layers (unpadded) are applied on the feature map, with ReLU layers inbetween reducing the number of feature channels to 64.
  3. Finally, a 1x1 convolution layer, followed by an activation layer (sigmoid for binary classification) is used to reduce the number of channels to the desired number of classes. In this case, 2 classes, as binary classification is often used in medical imaging.

After upsampling the feature map in the expanding path, a segmentation map should be generated, with each pixel classified individually.

In this section I would like to discuss what up-convolutions are and how changing the number of feature channels is possible. Convolutions, pooling, strides and padding were discussed in my previous CNN article and therefore, I have chosen not to cover them again. If necessary, please recap these concepts here.

Now let’s get into it.

Up-Convolution

An up-convolution, also known as a deconvolution or transpose convolution, is a method used to upsample images and recover spatial information.

Let’s look at the example below and briefly discuss what’s happening.

The U-Net: A Complete Guide (5)

The best way to perform up-convolutions is to expand and duplicate each element from the input feature map to the same size as the filter. This process up-samples the input. The filter is then applied over each of these expanded regions.

For example, the expanded green input above is initially just composed of four 1s. Likewise, the expanded red, yellow and grey regions are initially filled with just 2s, 3s and 4s, respectively. Next, the filter is applied over each of these regions and the results are summed to form the output feature map.

In the U-Net described above, the spatial dimensions were doubled, which means that a 2x2 filter was used with a stride of 2.

Changing the Amount of Channels

Throughout the U-Net, the number of feature channels are constantly changing. How do convolution operations affect this?

Well, the convolutions itself do not directly affect the number of channels present. It is in fact, determined by the number of filters used in the convolution layer. If 64 filters are applied over the input, with each attempting to extract a different feature, 64 feature maps will also be generated.

This may seem obvious to some, but was something that stumped me while learning this.

U-Nets are often used in medical imaging. They play crucial roles in detecting and locating tumors, cysts and other abnormalities.

Below is a possible example of what an input and output of a U-Net may look like.

The U-Net: A Complete Guide (6)

A medical grayscale image of a uterus was used as an input and fed into a U-Net. After having being processed in the U-Net, each pixel was classified into one of two classes: tumor or not-tumor. This segmentation map can be seen in the output image.

To conclude this article, let’s summarise what we have learnt.

The U-Net is an architecture that consists of 23 total layers. Using a combination of convolution, up-convolution, pooling and skip connections, the U-Net is able to extract and capture complex features, while also keeping and reconstructing spatial information. This allows for the localisation of features in an image, thus producing accurate segmentation maps. This is especially useful in medical image analysis where accurately locating and detecting abnormalities is vital.

Thank you for getting this far, if you have any questions do not hesitate to ask.

References

[1] Olaf Ronneberger, Philipp Fischer, Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597.

The U-Net : A Complete Guide (2024)

FAQs

Why is U-Net better than CNN? ›

The second one is an inter-comparison made between U-Net and CNN performance which proved that U-Net is more suitable for carrying out this task since it takes less time for training as it does not have a fully connected layer and offers a fairly significant similarity to ground truth compared to CNN.

What is the bottleneck in the U-Net? ›

Bottleneck

At the center of the U-Net is a bottleneck layer that captures the most critical features while maintaining spatial information.

What is the basics of U-Net? ›

The UNET architecture follows an “encoder-decoder” structure, where the contracting path represents the encoder, and the expanding path represents the decoder. This design resembles encoding information into a compressed form and then decoding it to reconstruct the original data.

What are the layers of U-Net? ›

U-Net Architecture

The stages within the U-Net encoder subnetwork consist of two sets of convolutional and ReLU layers, followed by a 2-by-2 max pooling layer. The decoder subnetwork consists of a transposed convolution layer for upsampling, followed by two sets of convolutional and ReLU layers.

What are the disadvantages of U-Net? ›

Disadvantages: A large number of parameters: UNet has many parameters due to the skip connections and the additional layers in the expanding path. This can make the model more prone to overfitting, especially when working with small datasets.

What is better than U-Net? ›

4, the ENet, and BoxENet computational performance was up to 10 to 15 times higher than the UNet performance.

What is the loss function of U-Net? ›

We trained U-Net neural network to perform semantic segmentation aerial images using 3 different loss functions, cross-entropy loss, focal loss, and IoU loss. The results demonstrate that cross-entropy loss cannot handle unbalanced datasets. Even adding weight for different classes is not very effective.

How do I know if my network is bottlenecking? ›

The first step to identify bottlenecks is to monitor your network traffic and analyze the data. You can use various tools, such as network analyzers, bandwidth testers, or performance monitors, to measure the amount of data that passes through your network devices, such as routers, switches, or servers.

What are skip connections in U-Net? ›

Skip connections in U-Net are employed to address the challenge of information loss during down-sampling and up-sampling in image segmentation tasks. These connections help merge features from different resolution levels, enhancing the model's ability to capture fine details.

When to use U-Net? ›

Applications. There are many applications of U-Net in biomedical image segmentation, such as brain image segmentation (''BRATS'') and liver image segmentation ("siliver07") as well as protein binding site prediction.

What are the advantages of using U-Net? ›

Another advantage is that it can capture both coarse and fine feature information, leading to improved segmentation performance. Additionally, using a parallel UNet architecture with a residual network can enhance the features of the segmented image through skip connections, further improving accuracy 4.

Why is it called U-Net? ›

U Net Architecture gets its name from its architecture. The “U” shaped model comprises convolutional layers and two networks. First is the encoder, which is followed by the decoder.

How many filters are there in U-Net? ›

Using the above formula the total number of filters = 32* (2^0) *2 = 64 filters. We have a max pooling operation after each down-convolution within the U-Net architecture. In general we add the max pooling operation after each down-convolution, except for the last layer; so no maxpooling at (depth -1) down-convolution.

Is U-Net deep learning? ›

U-Net is a widely used deep learning architecture that was first introduced in the “U-Net: Convolutional Networks for Biomedical Image Segmentation” paper.

How many levels are there in U-Net? ›

A standard Unet architecture, coined Unet, consisting of four levels, two consecutive sequences of convolution, ReLU, Batch-normalization at each level of encoding (left side) and decoding (right side) part. Two additional fully connected layers are added to the last level realized by 1 × 1 × 1 convolutions.

Why is GNN better than CNN? ›

The primary benefit of GNN is its capability to perform tasks that Convolutional Neural Networks (CNN) cannot. In contrast, CNN excels in tasks such as object identification, image categorization, and recognition, achieved through hidden convolutional layers and pooling layers.

Why CNN is preferred over feed forward neural nets? ›

Convolutional neural network is better than a feed-forward network since CNN has features parameter sharing and dimensionality reduction. Because of parameter sharing in CNN, the number of parameters is reduced thus the computations also decreased.

Which neural network is better than CNN? ›

Remember: ANNs (Artificial Neural Networks) are helpful for solving complex problems. CNNs (Convolution Neural Networks) are best for solving Computer Vision-related problems. RNNs (Recurrent Neural Networks) are proficient in Natural Language Processing.

Top Articles
Latest Posts
Article information

Author: Stevie Stamm

Last Updated:

Views: 6112

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Stevie Stamm

Birthday: 1996-06-22

Address: Apt. 419 4200 Sipes Estate, East Delmerview, WY 05617

Phone: +342332224300

Job: Future Advertising Analyst

Hobby: Leather crafting, Puzzles, Leather crafting, scrapbook, Urban exploration, Cabaret, Skateboarding

Introduction: My name is Stevie Stamm, I am a colorful, sparkling, splendid, vast, open, hilarious, tender person who loves writing and wants to share my knowledge and understanding with you.