Skip to content

Commit

Permalink
Merge pull request #441 from harvard-edge/440-first-part-of-robust_aiqmd
Browse files Browse the repository at this point in the history
440 First part of "robust_ai.qmd"
  • Loading branch information
profvjreddi authored Sep 9, 2024
2 parents 56b2b70 + bd0a9de commit 7584c5f
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 12 deletions.
24 changes: 13 additions & 11 deletions contents/robust_ai/robust_ai.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ This chapter explores the fundamental concepts, techniques, and tools for buildi

## Introduction

Robust AI refers to a system's ability to maintain its performance and reliability in the presence of hardware, software, and errors. A robust machine learning system is designed to be fault-tolerant and error-resilient, capable of operating effectively even under adverse conditions.
Robust AI refers to a system's ability to maintain its performance and reliability in the presence of errors. A robust machine learning system is designed to be fault-tolerant and error-resilient, capable of operating effectively even under adverse conditions.

As ML systems become increasingly integrated into various aspects of our lives, from cloud-based services to edge devices and embedded systems, the impact of hardware and software faults on their performance and reliability becomes more significant. In the future, as ML systems become more complex and are deployed in even more critical applications, the need for robust and fault-tolerant designs will be paramount.

Expand Down Expand Up @@ -74,7 +74,7 @@ However, in one instance, when the file size was being computed for a valid non-

The impact of this silent data corruption was significant, leading to missing files and incorrect data in the output database. The application relying on the decompressed files failed due to the data inconsistencies. In the case study presented in the paper, Facebook's infrastructure, which consists of hundreds of thousands of servers handling billions of requests per day from their massive user base, encountered a silent data corruption issue. The affected system processed user queries, image uploads, and media content, which required fast, reliable, and secure execution.

This case study illustrates how silent data corruption can propagate through multiple layers of an application stack, leading to data loss and application failures in a large-scale distributed system. The intermittent nature of the issue and the lack of explicit error messages made it particularly challenging to diagnose and resolve. But this is not restricted to just Meta, even other companies such as Google that operate AI hypercomputers face this challenge. @fig-sdc-jeffdean [Jeff Dean](https://en.wikipedia.org/wiki/Jeff_Dean), Chief Scientist at Google DeepMind and Google Research, discusses SDCS and their impact on ML systems.
This case study illustrates how silent data corruption can propagate through multiple layers of an application stack, leading to data loss and application failures in a large-scale distributed system. The intermittent nature of the issue and the lack of explicit error messages made it particularly challenging to diagnose and resolve. But this is not restricted to just Meta, even other companies such as Google that operate AI hypercomputers face this challenge. @fig-sdc-jeffdean [Jeff Dean](https://en.wikipedia.org/wiki/Jeff_Dean), Chief Scientist at Google DeepMind and Google Research, discusses SDCs and their impact on ML systems.

![Silent data corruption (SDC) errors are a major issue for AI hypercomputers. Source: [Jeff Dean](https://en.wikipedia.org/wiki/Jeff_Dean) at [MLSys 2024](https://mlsys.org/), Keynote (Google)](./images/jpg/sdc-google-jeff-dean.jpeg){#fig-sdc-jeffdean}

Expand All @@ -98,7 +98,9 @@ Let's consider a few examples, starting with outer space exploration. NASA's Mar

![NASA's Failed Mars Polar Lander mission in 1999 cost over \$200M. Source: [SlashGear](https://www.slashgear.com/1094840/nasas-failed-mars-missions-that-cost-over-200-million/)](./images/png/nasa_example.png){#fig-nasa-example}

Back on earth, in 2015, a Boeing 787 Dreamliner experienced a complete electrical shutdown during a flight due to a software bug in its generator control units. The bug caused the generator control units to enter a failsafe mode, cutting power to the aircraft's electrical systems and forcing an emergency landing. [This incident](https://www.engineering.com/story/vzrxw) underscores the potential for software faults to have severe consequences in complex embedded systems like aircraft. As AI technologies are increasingly applied in aviation, such as in autonomous flight systems and predictive maintenance, ensuring the robustness and reliability of these systems will be critical to passenger safety.
Back on earth, in 2015, a Boeing 787 Dreamliner experienced a complete electrical shutdown during a flight due to a software bug in its generator control units. This incident underscores the potential for software faults to have severe consequences in complex embedded systems like aircraft. As AI technologies are increasingly applied in aviation, such as in autonomous flight systems and predictive maintenance, ensuring the robustness and reliability of these systems will be critical to passenger safety.

>_“If the four main generator control units (associated with the engine-mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.” -- [Federal Aviation Administration directive](https://s3.amazonaws.com/public-inspection.federalregister.gov/2015-10066.pdf) (2015)_
As AI capabilities increasingly integrate into embedded systems, the potential for faults and errors becomes more complex and severe. Imagine a smart [pacemaker](https://www.bbc.com/future/article/20221011-how-space-weather-causes-computer-errors) that has a sudden glitch. A patient could die from that effect. Therefore, AI algorithms, such as those used for perception, decision-making, and control, introduce new sources of potential faults, such as data-related issues, model uncertainties, and unexpected behaviors in edge cases. Moreover, the opaque nature of some AI models can make it challenging to identify and diagnose faults when they occur.

Expand Down Expand Up @@ -146,7 +148,7 @@ Transient faults can manifest through different mechanisms depending on the affe

A common example of a transient fault is a bit flip in the main memory. If an important data structure or critical instruction is stored in the affected memory location, it can lead to incorrect computations or program misbehavior. If a transient fault occurs in the memory storing the model weights or gradients. For instance, a bit flip in the memory storing a loop counter can cause the loop to execute indefinitely or terminate prematurely. Transient faults in control registers or flag bits can alter the flow of program execution, leading to unexpected jumps or incorrect branch decisions. In communication systems, transient faults can corrupt transmitted data packets, resulting in retransmissions or data loss.

In ML systems, transient faults can have significant implications during the training phase [@he2023understanding]. ML training involves iterative computations and updates to model parameters based on large datasets. If a transient fault occurs in the memory storing the model weights or gradients, it can lead to incorrect updates and compromise the convergence and accuracy of the training process. @fig-sdc-training-fault Show a real-world example from Google's production fleet where an SDC anomaly caused a significant difference in the gradient norm.
In ML systems, transient faults can have significant implications during the training phase [@he2023understanding]. ML training involves iterative computations and updates to model parameters based on large datasets. If a transient fault occurs in the memory storing the model weights or gradients, it can lead to incorrect updates and compromise the convergence and accuracy of the training process. @fig-sdc-training-fault show a real-world example from Google's production fleet where an SDC anomaly caused a significant difference in the gradient norm.

![SDC in ML training phase results in anomalies in the gradient norm. Source: Jeff Dean, MLSys 2024 Keynote (Google)](./images/jpg/google_sdc_jeff_dean_anomaly.jpg){#fig-sdc-training-fault}

Expand Down Expand Up @@ -378,15 +380,15 @@ Adversarial attacks aim to trick models into making incorrect predictions by pro

One can generate prompts that lead to unsafe images in text-to-image models like DALLE [@ramesh2021zero] or Stable Diffusion [@rombach2022highresolution]. For example, by altering the pixel values of an image, attackers can deceive a facial recognition system into identifying a face as a different person.

Adversarial attacks exploit the way ML models learn and make decisions during inference. These models work on the principle of recognizing patterns in data. An adversary crafts special inputs with perturbations to mislead the model's pattern recognition\-\--essentially 'hacking' the model's perceptions.
Adversarial attacks exploit the way ML models learn and make decisions during inference. These models work on the principle of recognizing patterns in data. An adversary crafts special inputs with perturbations to mislead the model's pattern recognition---essentially 'hacking' the model's perceptions.

Adversarial attacks fall under different scenarios:

* **Whitebox Attacks:** The attacker fully knows the target model's internal workings, including the training data, parameters, and architecture [@ye2021thundernna]. This comprehensive access creates favorable conditions for attackers to exploit the model's vulnerabilities. The attacker can use specific and subtle weaknesses to craft effective adversarial examples.

* **Blackbox Attacks:** In contrast to white-box attacks, black-box attacks involve the attacker having little to no knowledge of the target model [@guo2019simple]. To carry out the attack, the adversarial actor must carefully observe the model's output behavior.

* **Greybox Attacks:** These fall between blackbox and whitebox attacks. The attacker has only partial knowledge about the target model's internal design [@xu2021grey]. For example, the attacker could have knowledge about training data but not the architecture or parameters. In the real world, practical attacks fall under black black-box box grey-boxes.
* **Greybox Attacks:** These fall between blackbox and whitebox attacks. The attacker has only partial knowledge about the target model's internal design [@xu2021grey]. For example, the attacker could have knowledge about training data but not the architecture or parameters. In the real world, practical attacks typically fall under black-box or grey-box categories.

The landscape of machine learning models is complex and broad, especially given their relatively recent integration into commercial applications. This rapid adoption, while transformative, has brought to light numerous vulnerabilities within these models. Consequently, various adversarial attack methods have emerged, each strategically exploiting different aspects of different models. Below, we highlight a subset of these methods, showcasing the multifaceted nature of adversarial attacks on machine learning models:

Expand Down Expand Up @@ -796,7 +798,7 @@ Detecting and mitigating distribution shifts is an ongoing process that requires

## Software Faults

#### Definition and Characteristics
### Definition and Characteristics

Software faults refer to defects, errors, or bugs in the runtime software frameworks and components that support the execution and deployment of ML models [@myllyaho2022misbehaviour]. These faults can arise from various sources, such as programming mistakes, design flaws, or compatibility issues [@zhang2008distribution], and can have significant implications for ML systems' performance, reliability, and security. Software faults in ML frameworks exhibit several key characteristics:

Expand All @@ -814,7 +816,7 @@ Software faults refer to defects, errors, or bugs in the runtime software framew

Understanding the characteristics of software faults in ML frameworks is crucial for developing effective fault prevention, detection, and mitigation strategies. By recognizing the diversity, propagation, intermittency, and impact of software faults, ML practitioners can design more robust and reliable systems resilient to these issues.

#### Mechanisms of Software Faults in ML Frameworks
### Mechanisms of Software Faults in ML Frameworks

Machine learning frameworks, such as TensorFlow, PyTorch, and sci-kit-learn, provide powerful tools and abstractions for building and deploying ML models. However, these frameworks are not immune to software faults that can impact ML systems' performance, reliability, and correctness. Let's explore some of the common software faults that can occur in ML frameworks:

Expand All @@ -830,7 +832,7 @@ Machine learning frameworks, such as TensorFlow, PyTorch, and sci-kit-learn, pro

**Inadequate Error Handling and Exception Management:** Proper error handling and exception management can prevent ML systems from crashing or behaving unexpectedly when encountering exceptional conditions or invalid inputs. Failing to catch and handle specific exceptions or relying on generic exception handling can make it difficult to diagnose and recover from errors gracefully, leading to system instability and reduced reliability. Furthermore, incomplete or misleading error messages can hinder the ability to effectively debug and resolve software faults in ML frameworks, prolonging the time required to identify and fix issues.

#### Impact on ML Systems
### Impact on ML Systems

Software faults in machine learning frameworks can have significant and far-reaching impacts on ML systems' performance, reliability, and security. Let's explore the various ways in which software faults can affect ML systems:

Expand All @@ -848,7 +850,7 @@ Software faults in machine learning frameworks can have significant and far-reac

Understanding the potential impact of software faults on ML systems is crucial for prioritizing testing efforts, implementing fault-tolerant designs, and establishing effective monitoring and debugging practices. By proactively addressing software faults and their consequences, ML practitioners can build more robust, reliable, and secure ML systems that deliver accurate and trustworthy results.

#### Detection and Mitigation
### Detection and Mitigation

Detecting and mitigating software faults in machine learning frameworks is essential to ensure ML systems' reliability, performance, and security. Let's explore various techniques and approaches that can be employed to identify and address software faults effectively:

Expand Down Expand Up @@ -883,7 +885,7 @@ Get ready to become an AI fault-fighting superhero! Software glitches can derail

## Tools and Frameworks

Given the significance or importance of developing robust AI systems, in recent years, researchers and practitioners have developed a wide range of tools and frameworks to understand how hardware faults manifest and propagate to impact ML systems. These tools and frameworks play a crucial role in evaluating the resilience of ML systems to hardware faults by simulating various fault scenarios and analyzing their impact on the system's performance. This enables designers to identify potential vulnerabilities and develop effective mitigation strategies, ultimately creating more robust and reliable ML systems that can operate safely despite hardware faults. This section provides an overview of widely used fault models in the literature and the tools and frameworks developed to evaluate the impact of such faults on ML systems.
Given the importance of developing robust AI systems, in recent years, researchers and practitioners have developed a wide range of tools and frameworks to understand how hardware faults manifest and propagate to impact ML systems. These tools and frameworks play a crucial role in evaluating the resilience of ML systems to hardware faults by simulating various fault scenarios and analyzing their impact on the system's performance. This enables designers to identify potential vulnerabilities and develop effective mitigation strategies, ultimately creating more robust and reliable ML systems that can operate safely despite hardware faults. This section provides an overview of widely used fault models in the literature and the tools and frameworks developed to evaluate the impact of such faults on ML systems.

### Fault Models and Error Models

Expand Down
2 changes: 1 addition & 1 deletion contents/sustainable_ai/sustainable_ai.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -650,7 +650,7 @@ Despite these promising directions, several challenges need to be addressed. One

1. Maximize the utilization of accelerator and system resources.
2. Prolong the lifetime of AI infrastructure.
3. design systems hardware with environmental impact in mind.
3. Design systems hardware with environmental impact in mind.

On the software side, we should trade off experimentation and the subsequent training cost. Techniques such as neural architecture search and hyperparameter optimization can be used for design space exploration. However, these are often very resource-intensive. Efficient experimentation can significantly reduce the environmental footprint overhead. Next, methods to reduce wasted training efforts should be explored.

Expand Down

0 comments on commit 7584c5f

Please sign in to comment.