SCIENTIFIC NEWS AND
INNOVATION FROM ÉTS
Enhancing Availability in a Huge Network Infrastructure: The Google Experience - By : Aifa Sassi, Tara Nath Subedi,

Enhancing Availability in a Huge Network Infrastructure: The Google Experience


Aifa Sassi
Aifa Sassi Author profile
Haifa SASSI is a Master’s student in the Synchromedia Lab. She received her B.Eng. degree in Telecommunications Engineering from the “École supérieure des communications de Tunis (Sup’com)” in 2015, and started her Master’s at the “École de technologie supérieure (ÉTS).” She is working on virtualization, cloud computing and software defined-WAN.

Tara Nath Subedi
Tara Nath Subedi Author profile
Tara Nath Subedi is a Ph.D. student in the Automated Manufacturing Engineering Department. He worked as a software developer, network engineer, system administrator and course instructor.

Google's Zero Touch Network (ZTN),

Bikash Koley est directeur de l'architecture réseau, ingénierie et planification chez GoogleProfessors  Mohamed Cheriet and Kim Khoa Nguyen, of the Synchromedia laboratory, organised a mini-conference, as part of IEEE’s 12th International Conference on Network and Service Management (CNSM), hosted from October 31 to November 4, 2016, at the École de technologie supérieure (ÉTS).

This article is an overview of the research axis on Zero Touch Network (ZTN), presented by Distinguished Engineer and Director of Network Architecture, Engineering and Planning at Google, Bikash Koley. It was written by a student researcher of the Synchromedia laboratory who attended the conference.

 

Huge network infrastructure

The network built by Google has a huge infrastructure: it contains two backbones, B4 and B2. B4 is the private WAN (Wan Area Network) that interconnects Google’s clusters across the world and B2 is the second backbone that interconnects Google data centers to the Internet. B4 is a centrally controlled global SDN (Software Defined Networking) backbone and B2 is the largest IP (Internet Protocol) backbone in the world. This infrastructure accommodates real-time applications with critical requirements in terms of bandwidth and high delay sensitivity. The following video presents the infrastructure of just one Google cluster, so imagine the huge infrastructure that interconnects Google’s DCs across the planet (B4 and B2)!

The main objective of cloud infrastructure providers, such as Google, is to offer the highest level of availability in their infrastructure in order to maximize resource utilization, and to minimize the total cost of maintenance and configuration. But, with the exponential evolution of application requirements in terms of bandwidth and QoS (Quality of Service) and the expansion of the infrastructure, it is not easy to ensure and to guarantee the required level of availability while maintaining scalability, efficiency and reliability in the network. In this context, we will share Google’s experience in this area.

Why is high network availability such a challenge?

In this huge infrastructure, even when the probability of failures is equal to 0.1%, the number of failures per day will be huge; in fact the number of failures per day increases with the size of the network. That is why it is very important to model the number of failures that can happen as the network scales, and to build an efficient and sufficient model for redundancy. In fact, building a massive amount of redundancy to avoid failures is a “bad” solution because it is a waste of resources and a waste of effort. So Google’s main objective is to build a system within the failure budget. It is not just about minimizing the number of failures, but about balancing the frequency of failures with the existent outage budget, while taking into account: velocity of evolution, need for scalability, and management complexity. So to achieve their main objective, Goggle proposes to focus on minimizing the Mean Time to Repair (MTTR) instead of maximizing the Mean Time between Failures (MTBF).

Tradeoff of network operations and failure analysis

An infrastructure reliable, scalable and efficient“Google wants to ensure scale, reliability and efficiency in their infrastructure, and traditional approaches are unable to guarantee these three specifications at the same time.”

Bikash KOLEY

 

 

 

To ensure scale, reliability and efficiency, Google proposes to analyze failures in order to understand their nature and their causes. To achieve this goal, Google uses a post-mortem approach which is based on a deep failure analysis. This technique, unlike a trouble ticket or an event log collection, is a curated description of failures. This process is blame free. It is about ensuring that everyone is completely forthcoming about the failure. A post-mortem is only written for previously unseen failures that have a significant impact. It contains the following details on each failure: duration, description of user impact (followed by a detailed timeline of events leading up to the outage), what happened during the outage, what happened after the failure was resolved and finally, root causes, which are based on sifting through logs, reviewing code, reproducing the failure in the lab. This is followed by a discussion on what worked and what did not work while trying to solve the problem causing the failure. After that, action items following the different decisions taken to overtake the studied failure can lead to code changes or process changes.

An analysis of 100 post-mortem reports written over a 2-year period led to three main observations:

  • Contrary to team expectations, there is a huge similarity in types of failures in the Google backbones B4 and B2, and no one network (B2, B4, cluster) dominates on failure occurrences.

So the main lesson here is: “There isn’t one network or one plane to focus on if you want to strengthen the network.”

  • 80% of failures are between 10 and 100 minutes, greatly exceeding the outage budget (4 minutes per month, for 99.99% availability).
  • 70% of failures happen when a management operation is in progress.

Proposed solution: Zero Touch Network

Network operation must be fully intent driven.“{reliability, efficiency, scale} are NOT tradeoffs…. if network operation is fully intent driven.”

Bikash KOLEY

 

 

 

ZTN (Zero Touch Network) architecture has the following characteristics:

  • Automation: all network operations are automated, requiring no operator steps beyond the instantiation of intent because humans can have bad and long days, so they can make mistakes; as a consequence, it is very important to minimize manual operations in the ZTN approach.
  • Auto-configuration: changes applied to individual network elements are fully declarative, vendor-neutral and derived by the network infrastructure from the high-level network-wide intent. In fact, the truth about the configuration and the structure of the network must be driven from the model and the centralized software designed to control and to monitor the network, in this case if you are not making changes to subsystems or elements, you do not have to deal with complex studies on the impact of changes.
  • Safety: any network changes are automatically halted and rolled back if the network displays unintended behavior and the infrastructure does not allow operations that violate network policies.

 Zero touch network (ZTN) architectureIn ZTN architecture, operators must never interact with network elements directly but the workflow agent, which is a software system, interacts with the infrastructure described by the configuration model and reports network state presented by the topology model through transactional APIs.

The final part of this architecture is the Network Management Layer which has the following missions: detect changes in the configuration model, build a new full-configuration model for network elements based on these changes, and push it to the network model.

Knowing the state of the network instantly is a key part of the proposed architecture, here, the utility of the Streaming Telemetry which is used to estimate network behavior after pushing the configuration change in the network model. Streaming Telemetry is used as a tool to check operation safety and to avoid network policy violations through the changes.

Lessons learned from the Google experience

Finally, the main lesson learned from ZTN design is: “Do not treat a change to the network as an exceptional event,” and to ensure this, it is essential to:

  • Accept that changes are common
  • Make it safe to improve the network on a daily basis
  • Consider network scaling often and just in time
  • Evolve into a “Zero Touch Network”

 

Aifa Sassi

Author's profile

Haifa SASSI is a Master’s student in the Synchromedia Lab. She received her B.Eng. degree in Telecommunications Engineering from the “École supérieure des communications de Tunis (Sup’com)” in 2015, and started her Master’s at the “École de technologie supérieure (ÉTS).” She is working on virtualization, cloud computing and software defined-WAN.

Research laboratories : SYNCHROMEDIA – Multimedia Communication in Telepresence 

Author profile

Tara Nath Subedi

Author's profile

Tara Nath Subedi is a Ph.D. student in the Automated Manufacturing Engineering Department. He worked as a software developer, network engineer, system administrator and course instructor.

Research chair : Canada Research Chair in Smart Sustainable Eco-Cloud 

Research laboratories : SYNCHROMEDIA – Multimedia Communication in Telepresence 

Author profile


Get the latest scientific news from ÉTS