With Pith

Ethan Petuchowski

First Look at MongoDB

The scenario

My (augmented todo-list) website has Categories, and each Category has a set of Tasks (aka “one-to-many”).

Using a relational database, one does not simply store references to all the Tasks directly in each Category. With Mongo, one may do exactly that.

After looking at various “embedded vs references” documents, I have decided to go with the hybrid model (the second of the three on that linked page), and embed references to Tasks in an array in each Category.

The console shall reveal the way to accomplish this

My lack of sustained googling didn’t turn up any online resources about how to do it, so here you go.

Install MongoDB

After installing MongoDB with Homebrew, it became clear that the configuration was still incomplete (perhaps because brew lacks root access to the root directory). StackOverflow told me how to finish. I show you.

1
2
3
4
5
6
7
8
brew install mongodb
sudo mkdir -p /data/db
sudo chmod 755 /data/db
sudo chown -R `id -u` /data/db

# now run
mongod # in one window (server daemon)
mongo  # in another window (interactive console)

Write the code

Now let’s do an example of having a one-to-many relationship between Categories and Tasks, where in particular, we are storing an array of Task id’s in each Category. In the end we will write a query to get all Tasks belonging to a Category.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* create the collections */
> db.createCollection("categories")
> db.categories.find().count()
> c = db.categories
> db.createCollection("tasks")
> t = db.tasks

/* insert some elements into them */
> c.insert({name: "first category", tasks: []})
> t.insert({name: "first task"})
> t.insert({name: "second task"})

/* add the references from the coll to the tasks */
> c.update({name: "first category"}, {$addToSet:{"tasks":t.findOne({name: "first task"})._id}})
> c.update({name: "first category"}, {$addToSet:{"tasks":t.findOne({name: "second task"})._id}})
> c.find().pretty()
{
    "_id" : ObjectId("5584fe86e2c6cdd3813233ed"),
    "name" : "first category",
    "tasks" : [
        ObjectId("5584fec9e2c6cdd3813233ee"),
        ObjectId("5584fed0e2c6cdd3813233ef")
    ]
}

/* get all task documents referenced by the "first category"
> t.find({_id: {$in: c.findOne().tasks}})
{ "_id" : ObjectId("5584fec9e2c6cdd3813233ee"), "name" : "first task" }
{ "_id" : ObjectId("5584fed0e2c6cdd3813233ef"), "name" : "second task" }

What Akamai Does

I met someone at a hack night worked for Akamai. He was a cool dude, so I was curious what Akamai does. For whatever reason I ended up reading a paper cited as so:

Nygren, Erik, Ramesh K. Sitaraman, and Jennifer Sun. “The akamai network: a platform for high-performance internet applications.” ACM SIGOPS Operating Systems Review 44.3 (2010): 2–19.

For whatever reason, everything I know about Akamai is from this paper. I’m not in any way afiliated with Akamai, etc., though the paper is basically a badly designed advertising pamphlet for their services. A small number of sentences in this summary may have been lifted directly from that paper. You can consider this whole post a paraphrasing of the paper. The intent of this post is to tell you everything important that I learned from the paper in a form that is way easier to digest than you reading the paper itself. To that end, this document is maybe 2 pages long, and the original is 17. Clearly much of the detail has been removed, but the gist remains. If that appeals to you, welcome aboard! Note that the paper is from 2010, so is probably already out of date.

Introduction

Akamai invented the Content Delivery Network (CDN) concept in the 1990s, because the raw naive Internet implementation is too slow to conduct a global business. As of 2010, Akamai delivers 15-20% of global Web traffic.

The Interwebs is a dangerous place

Brief Overview of Garbage Collection in the HotSpot JVM

Disclaimer: This post only claims to represent my current understanding of the workings of the HotSpot VM’s garbage collector. This understanding comes from reading the references listed below, as well as making some assumptions. It is not necessarily fully correct.

Brief Overview

The garbage collector separates the heap into 5 sections of 4 types. I don’t know whether these spaces are “virtual” (i.e. remapped to discontiguous “pages” like Linux’s virtual memory addresses) or not. I suspect they are not.

  1. Eden — where objects go when they’re first allocated in the running program
  2. Survivors 1 & 2 a.k.a. “young space” — objects in Eden are moved here if they survive a minor (“young”) GC
  3. Tenured (a.k.a. Old) — long-lived objects (that have survived [a configurable number of] minor GCs) are moved and then live in here
    • We can tell the JVM to allocate all objects larger than n bytes directly into the old space.
  4. Permanent — this is where the JVM’s own objects live (e.g. classes and JITed code). It behaves just like the tenured space.

Summary of ‘the Log’

I read a great article by Jay Kreps, one of the dudes who brought you Apache Kafka. The article is called The Log: What every software engineer should know about real-time data’s unifying abstraction. The article took me a few hours of mildly challenging reading, so I figure you can spend a few minutes reading my summary and decide more clearly whether you want to put in the investment of his explanation.

First, he describes the following concept: Let’s define a log as a total order of functions called and the parameters passed to each of those function- calls. Let’s define a “commit log” as a log of edits of the contents of a database. Kreps notes that we can use a commmit log to build the state of a database at any point in history. This is like git; as we patch in each commit, we obtain the state of the repo at each time. If we feed the same log to the same program on multiple machines in a cluster, (assuming none of the functions called are non- deterministic and machines behave as we expect), we will certainly have the same state on each machine after they have each executed the log. This would be desireable perhaps to provide a “reliable” service for which there are more reads than writes, and more reads than can be handled by one node; then we can ensure any node is OK to read from by having all nodes play functions as prescribed by the log.

Then, he describes the following situation: getting every part of a tech- company’s data to every service that needs it is very complicated. In the worst and most naive case it would be N2 because each of N places would be sending data to each of N places. That’s a lot of network bandwidth, and complexity, and data formats to understand, and places for things to screw up. So Kreps suggests

Paxos as a Classroom

multi-paxos as a classroom:
teacher := proposer
students := acceptors
all := learners

teacher to students:
    may I have your attention for the next 1:15 hrs?
    I'd like to teach youz guys all about "the paxos protocol"
    here is what I'm assuming you already know

each student to teacher:
    if not already busy for that time:
        yes, here's how much I actually already know about paxos
    else:
        my apologies, but I will be busy skimming facebook for the next 1:15
        so if there's something you want to tell me, post it on there

teacher to each student:
    given how far behind you are, these are the things you need to catch up on
    but also this is what I'm teaching now

each (alive & correct) student to teacher:
    oh very cool!
    this is how much I now understand of all the stuff you've said

teacher:
    if the majority understands more than before:
        set new start point for next time's lesson

Simulating a Distributed System

For my Distributed Computing class at school, there were 3 projects that each involved implementing a distributed protocol over a simulated distributed cluster of computers that are only ‘connected’ to one another in that they know each other’s IP address and must communicate over TCP sockets. For each of the 3 projects, I had a completely different method of implementing the system.

Implementation One: A True Distributed System in Vanilla Java

The first project (Three Phase Commit) was the only one for which I worked with a partner. We were unclear how to attack the problem of building a distributed system in a simple way, so we went ahead and implemented a truly distributable Java application, in which you could spin up processes on any computer and it would connect in to the master server, who would tell the new node the listening ports of other servers in the cluster manually, etc. This turned out to be perhaps overly ambitious. When a node was expecting a message from someone else, it would fire a timer thread, which on completion would report to the master to restart the failed process. This system did work properly (somehow) but it seemed to be living-on-the-edge. It also required two weeks of a lot of working on it and learning aspects of Java, unit testing, mocks, which Exception would get thrown by which network event & when, etc.

Implementation Two: Lightweight System Simulation

For the second project (Paxos), I said “Hooey” to all the Java hubbub, which while I enjoyed its low-level-ness, forcing me to peak further under the hood, it required a larger time commitment than I was prepared to make. I realized that my class’s requirements only specified that system “nodes” run concurrently, share no memory, and talk over TCP. So in Scala, I made a Node class state machine, and ran different instances of it on separate threads. This was too easy though, and I was disappointed to find my implementation complete more than two weeks before the due date.

Implementation Three: Akka Cluster

Akka Cluster is a Scala-based framework for building a distributed system, in which you define the address of a set of “seed” nodes, of whom at least one is supposed be available at all times. Then when other nodes start up, they become a “member” of the cluster by contacting any seed node. So for me (the “client” of Akka Cluster), all there is left to do is define the protocol to run on top of the cluster. (In this case the protocol was the simple eventually-consistent database protocol “Bayou”.)

Akka nodes follow the “Actor Model”, meaning they share no state with any other part of your program, making them “simple” to run on remote machines (I didn’t try that, the docs say it is easy though). The only way actors can communicate is via message passing, for which there is a dedicated syntax, !. For example:

1
2
case class MyMessage(data: DataElem)
anotherActor ! MyMessage(theData)

The code above would have the actor running this code, send anotherActor an instance of the case class MyMessage with the DataElem that was passed in.

Inter-actor communcation is guaranteed FIFO by Akka, meaning that for any pair of actors A, B, if A sends n messages to B, B will receive those messages in the same order that A sent them. NB this says nothing about the order with respect to other actors in the system.

Using Akka trivialized all sorts of implementation details I’d had to worry about in the first project, and then ignored for the second project.

If there had been a 4th project for this class, it’s clear to me that I would use Akka again. For some reason my code ran very slowly, and it probably had to do with some configuration-setting that I did not set in the Akka config file. That was really the only problem I had that I did not fix. One problem I did fix that took a while, was figuring out a good way to shut the system down. There were various options discussed online, and eventually I found one that worked for me, which was to call system.shutdown() on every instance System I instantiate. Another option was to send self ! PoisonPill but for some reason that didn’t work. I guess different ones work for different situations.

Asynchronous System

Before taking a Distributed Computing course at school, I had written for various program environments:

  1. a script that trains a neural network to recognize whether an English sentence is grammatical
  2. an iOS app that involves a fair bit of multi-threading
  3. a command line interpreter that forks and runs commands as background processes
  4. the Raft distributed consensus protocol inside a professor’s multi-threaded distributed system simulator
  5. some asynchronous features for a simplified database system where threads communicate solely via Java’s PipedInputStreams instead of using shared variables

Perhaps you are familiar with these situations, and like me thought they might be “distributed”. Reality check: they’re not. Similarly, in a distributed system, one has multiple “threads” processing data at once. However, these “threads” may be each on different machines; this adds some complexity. None of the situations outlined above have the sort of issues one can expect fairly regularly when talking amongst machines.

Different machines don’t share

Intro to Distributed Computing

This semester I have been taking Distributed Computing with Professor Lorenzo Alvisi. Distributed computing has been something I have been curious about for a long time, and aside from writing some MapReduce jobs during a summer internship 2 summers ago, I had no real experience thinking about distributed systems. The class was very enjoyable.

The main lesson was that distributed systems are very flaky places, but you can arrive at some surpringly simple solutions to most of the problems once you find a better way of expressing what you really want. The other main lesson was that finding a better way of expressing what you really want can be a confusing thing to do.

Scenario 1: “Two Generals’ Problem”

Persons 1 and 2 would like to be sure that both agreed on a time to meet, but neither can tell whether the other is listening to what is being said.

Person 1: Let’s meet up at 2:00pm

Now person 1 doesn’t know whether person 2 was listening though. So person 2 feels compelled to respond (while still multi-tasking).

Person 2: Yeah

Python Class Equivalence

vars(self) == vars(other), it seems too easy

I implemented a class

1
2
3
4
class A(object):
    def __init__(self):
        self.a = 123
        self.b = 436

Then when I created two equivalent instances and compared them, I got false.

Java Inheritance

I spent quite a while struggling with the Java class declaration syntax, and now I am starting to get it.

Problem simplification

I started out writing this post because I was totally right and the Java compiler was wrong, and so on, but now I have cleared it up and I was wrong and the compiler was right after all (surprise!). I was initially intending to use this post as a way of getting satisfaction out of how much smarter I am than the compiler, and instead I will use this space to explain to the next how to answer similar questions. I started by heavily simplifying my issue down to its core, and attacking the simplified problem instead of the original one with extraneous variables and interconnections. I’ve never approached a programming problem like that, and let me tell you: seems like it worked. By a twist of fate, the new version looks like a LinkedList, that’s not what I started out doing, but who knows what’ll happen when you keep smushing the problem from a larger context into a little jar.

First, we have a Base, which holds an element elem of any type, and a reference to another Base, quite like a linked list.

Base.java
1
2
3
4
5
6
7
8
class Base<T> {
    T elem;
    Base<T> next;

    Base(T elem) {
        this.elem = elem;
    }
}

Next, we have an Ord, which is a Base that is Comparable. We require that the element contained by an Ord is itself comparable, and furthermore that the method of comparing Ords is to compare their respective elems:

Ord.java
1
2
3
4
5
6
7
8
9
10
11
class Ord<U extends Comparable<U>> extends Base<U> implements Comparable<Ord<U>>
{
    Ord(U elem) {
        super(elem);
    }

    @Override
    public int compareTo(Ord<U> o) {
        return this.elem.compareTo(o.elem);
    }
}

This works fine, but note the following limitation:

main.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
public static void main(String[] args) {
    Ord<Integer> a = new Ord<Integer>(3);
    Ord<Integer> b = new Ord<Integer>(4);

    a.compareTo(b);         // this works just fine

    a.next = b;

    // this does not work because a.next is of type Base<Integer>
    a.compareTo(a.next);

    // this DOES work though because a.elem is of type Integer which is Comparable
    a.elem.compareTo(a.next.elem);
}

After writing and re-writing this blog post over and over, this makes sense now. I used to be confused: “shouldn’t the compiler know that a.next is an Ord?” And the answer is (now obvious) “Of course not.”