Aleksandar
Kuzmanovic, Associate Professor
Tech L457, 847-467-5519, akuzma@northwestern.edu
Office Hours: By appointment
Marc Warrior
Ford 2.206, warrior@u.northwestern.edu
Office Hours: By appointment
Time/Place:
Lectures: MW 12:30-1:50
Tech Lecture Room 5
This course will cover a broad range of topics related to
networking problems in cloud computing. In particular:
Datacenter
architectures: coping with heterogeneity
Today’s
data centers may contain tens of thousands of computers with significant
aggregate bandwidth requirements. The network architecture typically consists
of a tree of routing and switching elements with progressively more specialized
and expensive equipment moving up the network hierarchy. Non-uniform bandwidth
among data center nodes complicates application design and limits overall
system performance. We will study a range of datacenter network architectures
that aim to resolve the above heterogeneity-induced problems. In the context of
heterogeneity, we will further analyze architectures that aim to resolve the
problems experienced by small jobs, which are typically run for interactive
data analyses in datacenters, and which continue to be plagued by
disproportionately long-running tasks called stragglers. We will analyze
mitigation techniques based on speculation and job cloning.
Cloud applications
and usage
Reliability
at massive scale is one of the biggest challenges for cloud applications. Even
the slightest outage has significant financial consequences and impacts
customer trust. Large-scale platforms are implemented on top of an
infrastructure of tens of thousands of servers and network components located
in many datacenters around the world. At this scale, small and large components
fail continuously and the way persistent state is managed in the face of these
failures drives the reliability and scalability of the software systems. We
will analyze methods that make it possible to provide an “always-on”
experience, despite the component failures. We will further explain the
underlying tradeoffs between service availability and data consistency.
Cloud application
mitigation: hybrid architectures
We
will learn about challenges in migrating enterprise services into hybrid
cloud-based deployments, where enterprise operations are partly hosted
on-premises and partly in the cloud. Such hybrid architectures enable
enterprises to benefit from cloud-based architectures, while honoring
application performance requirements, and privacy restrictions on what services
may be migrated to the cloud. We will analyze the complexity inherent in
enterprise applications today in terms of their multi-tiered nature, large
number of application components, and interdependencies. We will shed insight
on security policies associated with enterprise applications in data centers.
We articulate the importance of ensuring reconfiguration of security policies
as enterprise applications are migrated to the cloud.
Network resource
sharing
The
network, similar to CPU and memory, is a critical and shared resource in the
cloud. However, unlike other resources, it is neither shared proportionally to
payment, nor do cloud providers offer minimum guarantees on network bandwidth.
We will analyze the fundamental tradeoffs when sharing cloud networks by
studying different allocation policies that allow users to navigate the
tradeoff space.
Cloud-specific
congestion control schemes
TCP
incast is a network transport pathology that affects
many-to-one communication patterns in datacenters. It is caused by a complex
interplay between datacenter applications, the underlying switches, network
topology, and TCP, which was originally designed for wide area networks. Incast increases the queuing delay of flows, and decreases
application level throughput to far below the link
bandwidth. The problem especially affects computing paradigms in which
distributed processing cannot progress until all parallel threads in a stage
complete. Examples of such paradigms include distributed file systems, web
search, advertisement selection, and other applications with partition or
aggregation semantics. We will analyze effective solutions to the TCP incast problem. Furthermore, we will study other congestion
control mechanisms and protocols proposed for datacenters.
Data placement
By
offering storage services in several geographically distributed data centers,
cloud computing platforms enable applications to offer low latency access to
user data. However, application developers are left to deal with the
complexities associated with choosing the storage services at which any object
is replicated and maintaining consistency across these replicas. We will
analyze a key-value store that exports a unified view of storage services in
geographically distributed data centers. We will further show that cloud
tenants can do a better job placing applications by understanding the
underlying cloud network as well as the demands of the applications. To do so,
tenants must be able to quickly and accurately measure the cloud network and
profile their applications, and then use a network-aware placement method to
place applications.
Security and privacy
issues in cloud services
We
will analyze new application platforms that prevent apps from misusing
information about their users. To strike a useful balance between users’
privacy and apps’ functional needs, such platforms shift much of the
responsibility for protecting privacy from the app and its users to the
platform itself. To achieve this, the platforms deploy a sandbox that spans the
user’s device and the cloud, specialized storage and communication channels
that enable common app functionalities.
Datacenter
performance characterization
Although
there is tremendous interest in designing improved networks for data centers,
very little is known about the network-level traffic characteristics of current
data centers. We will analyze results from empirical studies of the network
traffic in data centers belonging to different types of organizations. We will
analyze SNMP statistics, topology, and packet-level traces. We will examine the
range of applications deployed in these data centers and their placement, the
flow-level and packet-level transmission properties of these applications, and
their impact on network utilization, link utilization, congestion, and packet
drops. We will describe the implications of the observed traffic patterns for
data center internal traffic engineering as well as for architectures for data
center networks. We will further analyze upgrades made to the HTTP/1.1 protocol
to improve the user-perceived performance.
Software defined
networking
We
will analyze a new network architecture for the
enterprise. The architecture allows managers to define a single network-wide
fine-grain policy, and then enforces it directly. It couples extremely simple
flow-based Ethernet switches with a centralized controller that manages the
admittance and routing of flows. While radical, this design is
backwards-compatible with existing hosts and switches. We will further analyze
the design, implementation, and evaluation of an API for applications to
control a software-defined network (SDN). The API addresses the two key
challenges: how to safely decompose control and visibility of the network, and
how to resolve conflicts between untrusted users and across requests, while
maintaining baseline levels of fairness and security.
Energy usage
Large-scale
Internet applications, such as content distribution networks, are deployed
across multiple datacenters and consume massive amounts of electricity. To provide
uniformly low access latencies, these datacenters are geographically
distributed and the deployment size at each location reflects the regional
demand for the application. Consequently, an application’s environmental impact
can vary significantly depending on the geographical distribution of end-users,
as electricity cost and carbon footprint per watt is location specific. We will
analyze a flow optimization based framework for request-routing and traffic
engineering. It dynamically controls the fraction of user traffic directed to
each datacenter in response to changes in both request workload and carbon
footprint. It allows an operator to navigate the three-way tradeoff between
access latency, carbon footprint, and electricity costs and to determine an
optimal datacenter upgrade plan in response to increases in traffic load.
Students will form teams of two or three; each team will
tackle a well-defined research project during the quarter. A list of suggested
project topics will be provided. All
projects are subjected to approval by the instructor. The project component
will include a short written project proposal, a short mid-term project report,
a final project presentation, and a final project report. Each component adds
some significant element to the paper, and the overall project grade will be
based on the quality of each component of your work. The above project
components are due by email to the instructor by the end of the given day of
the respective week.
1. Week 1 (Tuesday 3/29) Project
presentations by group leaders
2. Week 2 (Monday 4/4) Form groups of 2
or 3, choose a topic for your project, and meet with the project leader.
3. Week 3 (Monday 4/11) Write an
introduction describing the problem and how you plan to approach it (what will
you actually do?). Include motivation (why does the problem matter?) and
related work (what have others already done about it?). 2 pages total.
4. Week 6 (Monday 5/4) Midterm
presentation. Update your paper to include your preliminary results. 5 pages
total.
5. Week 10 (Wednesday 6/1):
Presentations by all groups.
6. Week 11 (Friday 6/10) Turn in your
completed paper. 10 pages total. You should incorporate the comments received
during the presentation.
Each team will have a weekly meeting with project
leaders.
1. Paper reviews (15%), presentations
(20%) and debating in the class (15%): 50%
2. Projects 50% (Project proposal: 5%;
Midterm report: 5%; weekly report and meeting: 10%; project presentation: 10%; final
project report: 20%)
3. Research idea report (optional, 3
pages): 10%
EECS 340 or equivalent networking course
There will be no textbook for
this class. Classes will be a combination of lectures conducted by the
professor and discussions led by students. In cases when students are assigned
to discuss a topic in the class, they will review and discuss research papers
related to networking problems in cloud computing. Students must read the
assigned papers and submit paper reviews before each lecture. Two teams of
students will be chosen to debate and lead the discussion. One team will be
designated the offense and the other the defense. In class, the defense team
will present first. For 30 minutes the team will discuss the work as if it were
their own.
1. The team should
present the work and make a compelling case why the contribution is
significant. This will include the context of the contribution, prior work, and
in cases where papers are previously published, how the work has influenced the
research community or industry's directions (impact). If the paper is very
recent, the defense should present arguments for the potential impact. Coming
up with potential future work can show how the paper opens doors to new
research.
2. The
presentation should go well beyond a paper "summary". The defense
should not critique the work other than to try to pre-empt attacks from the
offense (e.g., by explicitly limiting the scope of the contribution).
3. The defense should also try to
look up related work to support their case (CiteSeer
is a good place to start looking.)
After the defense presentation, the offense team will state their case for
20 minutes.
1. This team should critique the work,
and make a case for missing links, unaddressed issues, lack of impact,
inappropriateness of the problem formulation, etc.
2. The more insightful and less obvious
the criticisms the better.
3. While the offense should prepare
remarks in advance, they should also react to the points made by the defense.
4. The offense should also try to look
up related work to support their case.
Next, the defense and offense will be allowed follow up arguments, and
finally, the class will question either side either for clarifications or to
add to the discussions and controversy and make their own points on either
side. The presentations should be written in Powerpoint
format and will be posted on the course web page after each class.
All students must read the assigned papers and write reviews for the papers
before each lecture. Email the reviews to the instructor and TA ( akuzma@cs.northwestern.edu
and warrior@u.northwestern.edu)
prior to each lecture. Periodically,
a random subset of the reviews will be evaluated and feedback will be provided
directly to students.
Please send one review in plain text per email in the body of the email
message.
A review should summarize the paper sufficiently to demonstrate your
understanding, should point out the paper's contributions, strengths as well as
weaknesses. Think in terms of what makes good research? What qualities make a
good paper? What are the potential future impacts of the work? Note that there
is no right or wrong answer to these questions. A review's quality will mainly
depend on its thoughtfulness. Restating the abstract/conclusion of the paper
will not earn a top grade. Reviews should cover all of the following aspects:
1. What is the main result of the paper? (One or
two sentence summary)
2. What strengths do you see in this paper? (Your
review needs have at least one or two positive things to say)
3. What are some key limitations, unproven
assumptions, or methodological problems with the work?
4. How could the work be improved?
5. What is its relevance today, or what future work
does it suggest?
6. Overall score? 1 (worst) - 5 (best) (no 3s)
Course web site: http://networks.cs.northwestern.edu/EECS495-s16/
Check it regularly for schedule changes and other course-related announcements.
Group Email: https://groups.google.com/forum/#!forum/eecs495-s16
March, 2016, Aleksandar
Kuzmanovic