31 August 2016

Robotron: towards Networks and DCs management at large scale

Recently I've stumbled upon this paper, where FB presents Robotron, a system for managing a massive production network in a top-down way. Design goals of Robotron has been/is to reduce effort and errors on management tasks by minimizing direct human intervention. 


As reported in the paper, Robotron is used to express high-level design intent, which is translated into low-level device configurations to be deployed safely. Robotron also monitors devices' operational state to ensure it does not deviate from the desired state. Since 2008, Robotron has been used to manage tens of thousands of network devices connecting hundreds of thousands of servers globally at FB.

FB infrastructure is a "network of networks" containing multiple domains: edge point-of-presence (POP) clusters, a global back-bone, and several large Data Centers (DC).  Interestingly the paper describes the network-wide abstraction layer that models and stores various network device attributes as well as network-level attributes and topology descriptions, e.g., routers, switches, optical devices, protocol parameters, topologies, etc. Physical and logical components are modeled as typed objects, value elds, and relationship fields. And there are APIs to provide operations to retrieve objects and their attributes.

Robotron looks like a very promising  approach that the research should look at in order to improve/extend the management practice for SDN-NFV convergent infrastructures.