Residual Networks (ResNet): The Power of Deep Learning for Computer Vision

Author: Missy Dunagan
Date: August 28, 2023

Abstract

In the rapidly evolving field of deep learning and computer vision, the introduction of Residual Networks (ResNet) by Kaiming He, Ziangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft Research marked a significant breakthrough. ResNet emerged as a solution to the problem of vanishing and exploding gradients, which had plagued deep neural networks’ training effectiveness. This paper explores the significance of ResNet and its impact on the field of machine learning.

Vanishing and exploding gradients are challenges that hindered the training of deep neural networks, making it difficult for them to learn and generalize effectively. ResNet’s innovation lies in its introduction of residual blocks and skip connections, enabling the network to learn incremental changes and maintain a steady flow of gradients during training. This advancement has led to improved accuracy, faster convergence, and better generalization, enhancing the model’s performance across various computer vision tasks.

This paper delves into the architecture, design principles, and mathematical foundations of ResNet. It explores its practical implementation, advantages, and limitations.

Additionally, the paper discusses real-world use cases, ranging from image classification to object detection, where ResNet has demonstrated its prowess. Ethical considerations, privacy implications, and data requirements are also examined to provide a holistic view of ResNet’s deployment.

As organizations strive to leverage AI and machine learning for enhanced decision-making and predictive capabilities, understanding ResNet’s capabilities and nuances becomes pivotal. By shedding light on ResNet’s impact, architecture, and applications, this paper provides insights into harnessing the potential of deep neural networks in computer vision tasks.

Through an exploration of ResNet’s evolution, spanning from its inception to its practical deployment and ethical implications, this paper seeks to enrich the expanding repository of insights within the realm of deep learning. It sheds light on the impact of ResNet and its applications, particularly in the domain of computer vision, thereby contributing to the collective understanding of the efficiency gains through advancements in AI.

Introduction

Microsoft Research’s Kaiming He, Ziangyu Zhang, Shaoqing Ren, and Jian Sun won the ImageNet Large Scale Visual Recognition Challenge of 2015 when they introduced Residual Networks (ResNet) solving a problem associated with deep neural networks of vanishing/exploding gradients.

Vanishing gradients and exploding gradients are problems that can occur during the training of deep neural networks resulting in performance degradation. When a neural network is very deep, the gradients (network parameters) can become extremely small (vanishing) or extremely large (exploding). This can make it difficult for the network to learn effectively because the changes to the parameters are either too small to have an impact or too large to be controlled. When a network experiences vanishing gradients, the earlier layers of the network do not update properly and this leads to a lack of learning and poor performance. When a network experiences exploding gradients, the network’s weights can become very large and makes the training process highly unstable. Large gradients can also cause the network to lose important information about the data being used for learning.

ResNet (Residual Network) introduced residual blocks that allow the network to “skip” over certain layers, making it easier for the gradients to flow back through the network without vanishing. ResNet’s residual blocks with skip connections allow gradients to flow more smoothly through the network alleviating the issue of exploding gradients which makes training more stable. This innovation helped ResNet to train much deeper networks effectively and achieve better results in tasks like image recognition.

In light of these challenges and breakthroughs, Residual Networks (ResNet) has had a transformative impact on the landscape of deep learning and computer vision. Through analysis of ResNet’s architecture, functioning, applications, and ethical considerations, the aim of this paper is to provide an understanding of how ResNet has changed the way we approach complex data processing tasks. The practical implementation of ResNet and its implications for various use cases will also provide insight for making sound decisions and fostering responsible AI innovation.

Residual Network (ResNet) Overview

A Convolutional Neural Network (CNN or ConvNet) is a specialized form of Neural Networks primarily designed for image and speech recognition tasks. These tasks were inspired by the biological layout of the human visual cortex.

Residual Networks (ResNet) advanced the Convolutional Neural Network (CNN) architecture and addressed the challenge of deep networks struggling with the vanishing gradient dilemma. ResNet enabled the training of deep neural networks without experiencing the adverse effects of diminishing performance.

ResNet’s innovative “residual blocks” allows the network to “skip” certain layers during information propagation. This strategic design minimizes the handicap of vanishing gradients during training, ensuring a smoother flow of gradients that contributes to effective learning.

The application of ResNet is commonly used in computer vision tasks like image classification, object detection, and image segmentation. It improves these tasks by increasing accuracy and speeding up the process of deep learning.

Architecture and Model Details

The Microsoft Research team used a 34-layer plain network architecture inspired by VGG-19 in which then the shortcut connection is added. These shortcut connections then convert the architecture into the residual network. ResNets can vary in size and are dependent on the size of each layer in the model and the number of convolutional layers between 18-152, but support up to thousands of layers. The name ResNet followed by a two or more digit number denotes the specific ResNet architecture. There are also newer versions called ResNext and DenseNet.

The core framework of the ResNet Architecture includes:

Residual Blocks: ResNet relies on its unique residual blocks, marrying identity mappings with learned residual mappings. This design streamlines the optimization and convergence throughout the training process. “The approach behind this network is instead of layers learning the underlying mapping, we allow the network to fit the residual mapping. So, instead of say H(x), initial mapping, let the network fit. The advantage of adding this type of skip connection is that if any layer hurt the performance of architecture then it will be skipped by regularization.”

Skip Connections: These connections enable the passage of information between layers in both forward and backward passes, enriching gradient flow. If you do not use a skip connection, the input is multiplied by a weight layer and the network risks becoming biased.

Weight Layer: A set of parameters that are applied to the input data to transform it during the learning process. These parameters, or weights, are learned through training and determine how the input data is modified as it passes through the network. These weight layers contribute to the model’s ability to learn residual functions with reference to the layer inputs.

Identity Mapping: The process of allowing data to pass through a layer without changing it. This is done by adding the input data to the layer’s output, like skipping a step. It keeps the original data as it is while preventing vanishing gradients. This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper without losing important data.

Activation: The output generated by a weight layer within the neural network. Activations represent the adjusted values produced by applying what is learned from the weight layer. They show how strongly certain features or patterns are present in the input data. The activations are passed on to subsequent layers for further processing and eventually contribute to the networks’ final prediction or classification.

See: Residual Learning Framework with Skip Connections

ResNet learns to adjust the input by a small amount in each layer then adds the original input back to it. This makes it easier for the network to learn incremental changes. “According to what we learned from the Residual Learning Framework with Skip Connections diagram above we can visualize this mathematically, “Instead of learning an unreferenced mapping (e.g., H(x)), the network learns a residual mapping (e.g., F (x) + x, with F (x) = H(x) − x).”

Normally, a neural network would try to learn a direct mapping from input “x” to output “H(x)”. In ResNet, the network learns the difference between the actual output and the input, the residual mapping “F(x)”. We get “F(x)” by subtracting the input “x” from the learned output “H(x)”.

H(x) = Neural network’s mapping from input “x” to output
F(x) = Learned residual mapping ((H(x) – x)

Skip Connections perform Identity Mapping and pass the input “x” to a later layer without much change. This output is then added to the outputs of the convolutional layers, and final output of the residual block.

x (Identity Mapping) = Output of Skip Connection
Previous layers’ transformations = Convolutional layers’ outputs

When you put all of this together into a single residual block of ResNet, you get:

Output = Convolutional layers’ outputs + Output of Skip Connection
Output = Convolutional layers’ transformations + x (identity mapping)

The learned residual mapping (F(x)) captures the difference between the ideal and actual output, and by adding this difference to the input (identity mapping), ResNet manages the learning process and effective flow of data while training deep neural networks.

Use Cases

ResNet has achieved successful results on various computer vision tasks, including image classification, object localization, object detection, and segmentation.

Computer Vision (CV) focuses on the interpretation and comprehension of images and videos. Its objective is to equip computers with the ability to “see” visual content, decode the visual data by recognizing distinctive features and context extracted through training. These models can analyze images and videos, translating the interpretations into predictive or decision-making functions.

ResNet is being used to enhance computer vision by increasing the accuracy and facilitating the use of more complex modeling. It equips users with this ability by way of object recognition. Object Recognition has three parts, image classification, object localization, and object detection.

Image Classification: Image classification involves assigning a label or category to an image using its pixel values. Image classification deals with the actual pixel makeup of an image. This technique finds diverse uses, including identifying post-disaster damage, monitoring crop well-being, and aiding in the examination of medical images for indications of illnesses.

Object Localization: Object localization refers to the task of not only detecting instances of objects in an image but also accurately localizing their positions using bounding boxes. It involves determining the coordinates of the bounding box that tightly encloses each detected object, indicating where the object is located within the image. It focuses on accurate position detection within known classes. This task is most known for use with autonomous vehicles, particularly for pedestrian detection. In the context of self-driving cars, object localization plays a crucial role in ensuring pedestrian safety. By accurately detecting and localizing pedestrians in real-time from camera inputs, the autonomous vehicle can make informed decisions to avoid collisions and ensure safe navigation.

Object Detection: Object detection is geared towards spotting specific instances, like humans, buildings, or cars, within an image. These models take an image as input and generate output that includes the coordinates of bounding boxes encompassing the detected objects, along with their corresponding labels. Images may comprise several objects, each having its bounding box and label (for instance, a car and a building). Additionally, an object can appear in different parts of an image, leading to instances like multiple cars within a single image. This task is frequently applied in self-driving systems to identify pedestrians, road signs, and traffic lights. It also finds uses in tasks such as object counting, image search, and more.

Segmentation: Segmentation approaches bounding boxes by considering pixel-wise classification. This method provides more intricate details such as object boundaries, and thus results in higher-resolution outputs. It is useful for processing medical images and satellite imagery.

Advantages

The ResNet architecture allows the network to skip certain layers, especially if the layer does not provide a better result.

Improved Accuracy
Faster Convergence
Better Generalization
Transfer Learning

Disadvantages

As advantageous as ResNet is, there are still limitations. According to Data Base Camp in March 2023, “it naturally happens that the dimensionality at the beginning of the skip connection does not match that at the end of the skip connection.This is especially the case if several layers are skipped…Thus the skip connection faces the problem of simply adding the inputs of previous layers to the output of later layers.

Increased Complexity
Overfitting
Interpretability

Comparisons

There are other options to consider for Imagenet datasets. The comparison chart below was published in Towards Data Science June 7, 2019 by Aqeel Anwar.

Anwar concluded at the time that although ResNet has a much higher accuracy score over AlexNet, it “requires a lot of computations (about 10 times more than that of AlexNet) which means more training time and energy required.” He also noted that VGGNet not only had a lower accuracy than RexNet, but “It takes more time to train a VGGNet”. We can deduce from the comparison chart that Inception scored slightly lower than ResNet in terms of accuracy, but with far fewer parameters.

We will explore this further by asking 3 why’s. Why is accuracy, why is training time, and why is energy consumption an essential consideration when choosing an architecture?

Why is accuracy essential when considering an architecture?

The accuracy of a neural network architecture directly impacts the quality of its predictions. An architecture that consistently achieves higher accuracy ensures that the model’s predictions align closely with the truth of the data, enabling reliable decision-making and enhancing the effectiveness of the deployed solution.

Why is training time essential when considering an architecture?

Training time is a pivotal consideration, particularly in time-sensitive scenarios and resource-constrained environments. The architecture’s complexity, layer depth, and the availability of resources all influence the time required for training. Reduced training time not only accelerates the build and training hours but also means a quicker deployment, which translates to a faster product delivery timeline.

Why is energy consumption essential when considering an architecture?

Architectures that demand excessive energy consumption during training and deployment phases can take a toll on resources and increase operational costs. By leveraging energy-efficient architectures, organizations can align their technological advancements and optimize their operational budgets.

Implementation

Implementing ResNet may be facilitated by a variety of libraries and frameworks that offer a range of features and capabilities to suit different project needs. The choice of framework may depend on user preference and/or specific features needed for desired customizations. Examples for some prominent libraries and frameworks that support the implementation of ResNet are below:

TensorFlow: TensorFlow is a widely-used open-source framework developed by Google that offers a comprehensive suite of tools for building and training deep learning models. It provides a flexible environment for constructing ResNet architectures and fine-tuning them according to the specific task.

Keras: Keras is a high-level neural network API that can run on top of various deep learning frameworks, including TensorFlow. Known for its user-friendly interface, Keras simplifies the process of creating ResNet models with its intuitive syntax and pre-built layers.

Caffe: Caffe is a deep learning framework designed for efficiency and speed. It has gained popularity for its ability to process large-scale image datasets efficiently, making it suitable for ResNet implementations in image-related tasks.

CNTK (Microsoft Cognitive Toolkit): Developed by Microsoft, CNTK is an open-source deep learning framework that offers flexibility for designing complex neural network architectures. It supports ResNet implementations and can be beneficial for research-oriented projects.

PyTorch: PyTorch is an open-source machine learning library developed by Facebook’s AI Research lab. It is known for its dynamic computation graph and is particularly favored by researchers for its flexibility in creating and modifying ResNet models.

MXNet: MXNet is a deep learning framework that emphasizes both efficiency and flexibility. It offers support for symbolic and imperative programming, making it suitable for constructing ResNet architectures and other neural network models.

Chainer: Chainer is a framework designed for dynamic neural networks, enabling the creation of models with dynamic computation graphs. This characteristic can be advantageous when implementing ResNet variants that require adaptive adjustments during training.

Fastai: Fastai is a high-level library built on top of PyTorch that aims to simplify the process of training deep learning models. It provides convenient functions for creating ResNet models and handling various data preprocessing tasks.

The choice of implementation framework is dependent on the specific project goals, the team’s familiarity with the tools, and the compatibility with the dataset and computing resources available.

Data Requirements

Key points to consider in terms of data requirements for training a ResNet model:

Image Data: ResNet is primarily used for image-related tasks, so a dataset of images is relevant. The more diverse and representative the dataset is, the better the model’s ability to generalize.

Labeling: Each image in the dataset needs to be labeled correctly for the task of training the ResNet model.

Data Augmentation: Data augmentation techniques like rotation, cropping, flipping, and color adjustments can enhance the model’s ability to generalize by introducing variations of the training data.

Data Preprocessing: Images often need preprocessing before feeding into the network. This includes resizing images to a consistent size, normalizing pixel values, and potentially applying specific preprocessing steps depending on the architecture’s requirements.

Data Split: The dataset should be split into three parts: training data, validation data, and testing data. The training data is used to train the model, the validation data helps tune hyperparameters, and the testing data evaluates the model’s final performance.

Balanced Classes: If the dataset has multiple classes, it’s important to have a balanced distribution of samples across classes. Imbalanced classes can lead to biased model predictions.

Quality Control: Ensure the quality of the dataset by removing duplicates, inaccurately labeled samples, and outliers.

Hardware Resources: Training deep learning models like ResNet can be computationally intensive. Suitable hardware resources like GPUs will accelerate training.

Data Privacy and Ethics: Respect data privacy regulations and ensure that the data is collected ethically and with proper consent.

Responsible AI and Human-in-the-Loop (HITL)

Advancements in artificial intelligence, including models like ResNet, come with an inherent responsibility to ensure their ethical and responsible deployment. As AI technologies become increasingly integrated into various domains, it becomes essential to consider the potential societal and ethical implications they may carry. Incorporating Responsible AI practices and the concept of Human-in-the-Loop (HITL) can help mitigate risks and ensure that the benefits of AI are harnessed in a responsible manner.

Responsible AI Principles

Transparency: When deploying ResNet models, it’s crucial to ensure transparency in how the model operates, how it makes decisions, and the underlying processes. Providing explanations for model predictions can enhance user trust and understanding.

Fairness and Bias Mitigation: Addressing biases that may exist in training data or model outputs is paramount. Techniques such as bias-aware regularization, data preprocessing adjustments, and fairness-aware machine learning should be employed to prevent biased outcomes.

Accountability: Defining clear lines of accountability for AI systems is necessary. This involves attributing responsibilities to individuals or teams for the system’s design, monitoring, and addressing any issues that arise.

Data Privacy: ResNet models often require extensive datasets for training. Ensuring compliance with data privacy regulations and using anonymization techniques to protect sensitive user information is vital.

Human-in-the-Loop (HITL) Approach

While ResNet models are powerful, there are complexities and nuances in real-world data that may be challenging for the model to comprehend. Adopting a Human-in-the-Loop approach can enhance the model’s accuracy and ethical performance.

Annotation and Validation: Human annotators can validate model outputs, enhancing the reliability of predictions. This iterative process ensures high-quality data for ongoing model improvement.

Complex Decision-making: For critical decisions, involving human experts can help assess situations that may require contextual understanding or ethical judgment beyond the model’s capabilities.

Feedback Loop: Building a feedback mechanism where users can provide feedback on model predictions can improve the model’s accuracy over time. This iterative process aids in model recalibration and fine-tuning.

Ethical Considerations

Bias and Fairness: Continuous monitoring for bias in model predictions and taking steps to address them is essential. Regularly assessing fairness metrics ensures equitable outcomes across different groups.

Interpretability: Ensuring that the decisions made by ResNet models are interpretable to humans is critical. This helps users understand how the model reaches its conclusions and increases their confidence in the model’s predictions.

Incorporating Responsible AI practices and embracing the Human-in-the-Loop approach is not only an ethical obligation but also a strategic choice. It helps in building trust, reducing risks, and ensuring that the transformative potential of AI, as demonstrated by ResNet, is harnessed responsibly. By integrating these principles into the deployment of AI systems, we can drive positive societal impact while mitigating potential harms.

Conclusion

In the rapidly evolving landscape of deep learning and computer vision, Residual Networks (ResNet) stand as an incredible breakthrough. This paper presents the significance of ResNet, from its inception as a solution to vanishing and exploding gradients to its impact on various computer vision tasks. By exploring its architecture, applications, advantages, and limitations, a comprehensive guide for understanding ResNet’s significance in the field of artificial intelligence has been provided.

ResNet’s use of residual blocks and skip connections has overcome the challenges associated with training deep neural networks, leading to improved accuracy, faster convergence, and better generalization. Its widespread applications in image classification, object detection, and segmentation have expanded the possibilities of computer vision. Moreover, the examination of ResNet’s ethical considerations and its role in responsible AI practices stress the importance of integrating advanced models with ethical principles.

As organizations increasingly harness the capabilities of AI for decision-making and predictive analytics, ResNet’s architecture and capabilities offer a springboard for pushing the boundaries of deep learning. By adopting the Responsible AI framework and involving the Human-in-the-Loop Approach, we can ensure that the benefits of AI are leveraged responsibly, aligning technological innovation with societal well-being.

In conclusion, ResNet exemplifies the power of creative problem-solving in AI research. Its legacy continues to impact the development of new architectures and applications, shaping the future of computer vision and the broader field of artificial intelligence. By understanding ResNet’s evolution and the principles that underpin its success, we are better equipped to navigate the dynamic landscape of AI and contribute to a more ethically sound and innovative future.

References

He, K., Zhang, X., Ren, S., Sun, J. (2015, December 10). Deep Residual Learning for Image Recognition. Cornell University arXiv. https://arxiv.org/abs/1512.03385
Pawagfg., jatingrg2399., simmytarika5,. (2023, Jan 10). Residual Networks (ResNet) – Deep Learning. Geeksforgeeks. https://www.geeksforgeeks.org/residual-networks-resnet-deep-learning/
Ruiz, Pablo. (2018, Oct 8). Understanding and Visualizing ResNets. Towards Data Science. https://towardsdatascience.com/understanding-and-visualizing-resnets-442284831be8
Wikipedia contributors. (2023, July 11). Residual neural network. In Wikipedia, The Free Encyclopedia. Retrieved August 28 2023 from https://en.wikipedia.org/wiki/Residual_neural_network
Great Learning Team. (2022, March 22). Introduction to Resnet or Residual Network. Great Learning. https://www.mygreatlearning.com/blog/resnet/
Giannopoulos, M., Aidini, A., Pentari, A., Fotiadou, K. (2020, April). Classification of Compressed Remote Sensing Multispectral Images via Convolutional Neural Networks. ResearchGate. https://www.researchgate.net/publication/340835169_Classification_of_Compressed_Remote_Sensing_Multispectral_Images_via_Convolutional_Neural_Networks
Databasecamp contributors. (2023, March 18). ResNet: Residual Neural Networks – easily explained!. Data Base Camp. https://databasecamp.de/en/ml/resnet-en
Brownlee, Jason. (2019, July 5). Best Practices for Preparing and Augmenting Image Data for CNNs. Machine Learning Mastery. https://machinelearningmastery.com/best-practices-for-preparing-and-augmenting-image-data-for-convolutional-neural-networks/
Anwar, A. (2019, June 7). Difference between AlexNet, VGGNet, ResNet, and Inception. Towards Data Science. https://towardsdatascience.com/the-w3h-of-alexnet-vggnet-resnet-and-inception-7baaaecccc96
Run.ai contributors. Deep Learning for Computer Vision The Abridged Guide. Run.ai. Retrieved August 28, 2023. https://www.run.ai/guides/deep-learning-for-computer-vision
Rodriquez, M.L. (2021, August 21). Object Detection using a Deep Neural Network. Medium. https://medium.com/@yrodriguezmd/object-detection-using-a-deep-neural-network-213ec8ac2da8
Hugging Face contributors. Object Detection. Hugging Face. Retrieved August 28, 2023. https://huggingface.co/docs/transformers/tasks/object_detection
Hugging Face contributors. Image Classification. Hugging Face. Retrieved August 28, 2023. https://huggingface.co/docs/transformers/tasks/image_classification
Borad, Anand. (2021 May 12). Understanding Object Localization with Deep Learning. E infochips. https://www.einfochips.com/blog/understanding-object-localization-with-deep-learning/
Gupta, A., Anpalagan, A., Guan, L. Khwaja, A.S. (2021 July 10). Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. ScienceDiet. https://www.sciencedirect.com/science/article/pii/S2590005621000059
Erickson, B.J., Korfiatis, P., Akkus, Z., Kline, T., Philbrick, K. (2017 Mar 17). Toolkits and Libraries for Deep Learning. Springer Link. https://link.springer.com/article/10.1007/s10278-017-9965-6
Wikipedia contributors. (2023, March 21). Human-in-the-loop. In Wikipedia, The Free Encyclopedia. Retrieved August 29 2023 from https://en.wikipedia.org/wiki/Human-in-the-loop
National Institute of Standards and Technology (NIST). (2023, March 22). Trustworthy & Responsible AI Resource Center. Retrieved August 29, 2023 from https://airc.nist.gov/Home
Wang, Z., Qinami, K., Karakozis, I.C., Genova, K., Nair, P., Hata, K., Russakovsky, O. (2019, November 26). Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation. Cornell University arXiv. https://arxiv.org/abs/1911.11834