The U-Net : A Complete Guide (2024)

Alejandro Ito Aramendia

Block 1

An input image with dimensions 572² is fed into the U-Net. This input image consists of only 1 channel, likely a grayscale channel.
Two 3x3 convolution layers (unpadded) are then applied to the input image, each followed by a ReLU layer. At the same time the number of channels are increased to 64 in order to capture higher level features.
A 2x2 max pooling layer with a stride of 2 is then applied. This downsamples the feature map to half its size, 284².

Block 2

Just like in block 1, two 3x3 convolution layers (unpadded) are applied to the output of block 1, each followed again by a ReLU layer. At each new block the number of feature channels are doubled, now to 128.
Next a 2x2 max pooling layer is again applied to the resulting feature map reducing the spatial dimensions by half to 140².

Block 3

The procedure used in block 1 and 2 is the same as in block 3, so will not be repeated.

Block 4

Same as block 3.

Block 5

In the final block of the contracting path, the number of feature channels reach 1024 after being doubled at each block.
This block also contains two 3x3 convolution layers (unpadded), which are each followed by a ReLU layer. However, for symmetry purposes, I have only included one layer and included the second layer in the expanding path.

After complex features and patterns have been extracted, the feature map moves on to the expanding path.

The expanding path uses both convolution and up-convolution operations to combine learnt features and upsample the input feature map until it generates a segmentation map.

Much like with the contracting path, each block will be discussed below.

Before we read further: Skip connections are used to send images directly from the contracting path to the expanding path without them having to go through all the blocks. This allows for both high and low level features to be preserved and learnt, reducing any information loss that occurs during the contracting path.

Block 5

Continuing on from the contracting path, a second 3x3 convolution (unpadded) is applied with a ReLU layer after it.
Then a 2x2 convolution (up-convolution) layer is applied, upsampling the spatial dimensions twofold and also halving the number of channels to 512.

Block 4

Using skip connections, the corresponding feature map from the contracting path is then concatenated, doubling the feature channels to 1024. Note that this concatenation must be cropped to match the expanding path’s dimensions.
Two 3x3 convolution layers (unpadded) are applied, each with a ReLU layer following, reducing the channels to 512.
After, a 2x2 convolution (up-convolution) layer is applied, upsampling the spatial dimensions twofold and also halving the number of channels to 256.

Block 3

The procedure used in block 5 and 4 is the same as in block 3, so will not be repeated.

Block 2

Same as block 3.

Block 1

In the final block of the expanding path, there are 128 channels after concatenating the skip connection.
Next, two 3x3 convolution layers (unpadded) are applied on the feature map, with ReLU layers inbetween reducing the number of feature channels to 64.
Finally, a 1x1 convolution layer, followed by an activation layer (sigmoid for binary classification) is used to reduce the number of channels to the desired number of classes. In this case, 2 classes, as binary classification is often used in medical imaging.

After upsampling the feature map in the expanding path, a segmentation map should be generated, with each pixel classified individually.

In this section I would like to discuss what up-convolutions are and how changing the number of feature channels is possible. Convolutions, pooling, strides and padding were discussed in my previous CNN article and therefore, I have chosen not to cover them again. If necessary, please recap these concepts here.

Now let’s get into it.

Up-Convolution

An up-convolution, also known as a deconvolution or transpose convolution, is a method used to upsample images and recover spatial information.

Let’s look at the example below and briefly discuss what’s happening.

The best way to perform up-convolutions is to expand and duplicate each element from the input feature map to the same size as the filter. This process up-samples the input. The filter is then applied over each of these expanded regions.

For example, the expanded green input above is initially just composed of four 1s. Likewise, the expanded red, yellow and grey regions are initially filled with just 2s, 3s and 4s, respectively. Next, the filter is applied over each of these regions and the results are summed to form the output feature map.

In the U-Net described above, the spatial dimensions were doubled, which means that a 2x2 filter was used with a stride of 2.

Changing the Amount of Channels

Throughout the U-Net, the number of feature channels are constantly changing. How do convolution operations affect this?

Well, the convolutions itself do not directly affect the number of channels present. It is in fact, determined by the number of filters used in the convolution layer. If 64 filters are applied over the input, with each attempting to extract a different feature, 64 feature maps will also be generated.

This may seem obvious to some, but was something that stumped me while learning this.

U-Nets are often used in medical imaging. They play crucial roles in detecting and locating tumors, cysts and other abnormalities.

Below is a possible example of what an input and output of a U-Net may look like.

A medical grayscale image of a uterus was used as an input and fed into a U-Net. After having being processed in the U-Net, each pixel was classified into one of two classes: tumor or not-tumor. This segmentation map can be seen in the output image.

To conclude this article, let’s summarise what we have learnt.

The U-Net is an architecture that consists of 23 total layers. Using a combination of convolution, up-convolution, pooling and skip connections, the U-Net is able to extract and capture complex features, while also keeping and reconstructing spatial information. This allows for the localisation of features in an image, thus producing accurate segmentation maps. This is especially useful in medical image analysis where accurately locating and detecting abnormalities is vital.

Thank you for getting this far, if you have any questions do not hesitate to ask.

References

[1] Olaf Ronneberger, Philipp Fischer, Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597.

FAQs

Why is U-Net better than CNN? ›

The second one is an inter-comparison made between U-Net and CNN performance which proved that U-Net is more suitable for carrying out this task since it takes less time for training as it does not have a fully connected layer and offers a fairly significant similarity to ground truth compared to CNN.

Keep Reading ›

What is the bottleneck in the U-Net? ›

Bottleneck

At the center of the U-Net is a bottleneck layer that captures the most critical features while maintaining spatial information.

What is the basics of U-Net? ›

The UNET architecture follows an “encoder-decoder” structure, where the contracting path represents the encoder, and the expanding path represents the decoder. This design resembles encoding information into a compressed form and then decoding it to reconstruct the original data.

Discover More Details ›

What are the layers of U-Net? ›

U-Net Architecture

The stages within the U-Net encoder subnetwork consist of two sets of convolutional and ReLU layers, followed by a 2-by-2 max pooling layer. The decoder subnetwork consists of a transposed convolution layer for upsampling, followed by two sets of convolutional and ReLU layers.

See Details ›

What are the disadvantages of U-Net? ›

Disadvantages: A large number of parameters: UNet has many parameters due to the skip connections and the additional layers in the expanding path. This can make the model more prone to overfitting, especially when working with small datasets.

Read The Full Story ›

What is better than U-Net? ›

4, the ENet, and BoxENet computational performance was up to 10 to 15 times higher than the UNet performance.

Read The Full Story ›

What is the loss function of U-Net? ›

We trained U-Net neural network to perform semantic segmentation aerial images using 3 different loss functions, cross-entropy loss, focal loss, and IoU loss. The results demonstrate that cross-entropy loss cannot handle unbalanced datasets. Even adding weight for different classes is not very effective.

View Details ›

How do I know if my network is bottlenecking? ›

The first step to identify bottlenecks is to monitor your network traffic and analyze the data. You can use various tools, such as network analyzers, bandwidth testers, or performance monitors, to measure the amount of data that passes through your network devices, such as routers, switches, or servers.

Read On ›

What are skip connections in U-Net? ›

Skip connections in U-Net are employed to address the challenge of information loss during down-sampling and up-sampling in image segmentation tasks. These connections help merge features from different resolution levels, enhancing the model's ability to capture fine details.

Find Out More ›

When to use U-Net? ›

Applications. There are many applications of U-Net in biomedical image segmentation, such as brain image segmentation (''BRATS'') and liver image segmentation ("siliver07") as well as protein binding site prediction.

What are the advantages of using U-Net? ›

Another advantage is that it can capture both coarse and fine feature information, leading to improved segmentation performance. Additionally, using a parallel UNet architecture with a residual network can enhance the features of the segmented image through skip connections, further improving accuracy ⁴.

Why is it called U-Net? ›

U Net Architecture gets its name from its architecture. The “U” shaped model comprises convolutional layers and two networks. First is the encoder, which is followed by the decoder.

Discover More ›

How many filters are there in U-Net? ›

Using the above formula the total number of filters = 32* (2^0) *2 = 64 filters. We have a max pooling operation after each down-convolution within the U-Net architecture. In general we add the max pooling operation after each down-convolution, except for the last layer; so no maxpooling at (depth -1) down-convolution.

Get More Info ›

Is U-Net deep learning? ›

U-Net is a widely used deep learning architecture that was first introduced in the “U-Net: Convolutional Networks for Biomedical Image Segmentation” paper.

Show Me More ›

How many levels are there in U-Net? ›

A standard Unet architecture, coined Unet, consisting of four levels, two consecutive sequences of convolution, ReLU, Batch-normalization at each level of encoding (left side) and decoding (right side) part. Two additional fully connected layers are added to the last level realized by 1 × 1 × 1 convolutions.

Learn More Now ›

Why is GNN better than CNN? ›

The primary benefit of GNN is its capability to perform tasks that Convolutional Neural Networks (CNN) cannot. In contrast, CNN excels in tasks such as object identification, image categorization, and recognition, achieved through hidden convolutional layers and pooling layers.

Learn More Now ›

Why CNN is preferred over feed forward neural nets? ›

Convolutional neural network is better than a feed-forward network since CNN has features parameter sharing and dimensionality reduction. Because of parameter sharing in CNN, the number of parameters is reduced thus the computations also decreased.

Find Out More ›

Which neural network is better than CNN? ›

Remember: ANNs (Artificial Neural Networks) are helpful for solving complex problems. CNNs (Convolution Neural Networks) are best for solving Computer Vision-related problems. RNNs (Recurrent Neural Networks) are proficient in Natural Language Processing.

Learn More ›

The U-Net : A Complete Guide (2024)

Table of Contents

Block 1

Block 2

Block 3

Block 4

Block 5

Block 5

Block 4

Block 3

Block 2

Block 1

Up-Convolution

Changing the Amount of Channels

References

FAQs

Why is U-Net better than CNN? ›

What are the advantages of using U-Net? ›