My (augmented todo-list) website has Categories, and each Category has a set
of Tasks (aka “one-to-many”).
Using a relational database, one does not simply store references to all the
Tasks directly in each Category. With Mongo, one may do exactly that.
After looking at various “embedded vs references”documents, I have decided to go with the hybrid
model (the second of the three on that linked page), and embed references to
Tasks in an array in each Category.
The console shall reveal the way to accomplish this
My lack of sustained googling didn’t turn up any online resources about how to
do it, so here you go.
Install MongoDB
After installing MongoDB with Homebrew, it became clear that the configuration
was still incomplete (perhaps because brew lacks root access to the root
directory). StackOverflow told me how to finish. I show you.
12345678
brew install mongodb
sudo mkdir -p /data/db
sudo chmod 755 /data/db
sudo chown -R `id -u` /data/db
# now runmongod # in one window (server daemon)mongo # in another window (interactive console)
Write the code
Now let’s do an example of having a one-to-many relationship between Categories
and Tasks, where in particular, we are storing an array of Task id’s in each
Category. In the end we will write a query to get all Tasks belonging to a
Category.
1234567891011121314151617181920212223242526272829
/* create the collections */>db.createCollection("categories")>db.categories.find().count()>c=db.categories>db.createCollection("tasks")>t=db.tasks/* insert some elements into them */>c.insert({name:"first category",tasks:[]})>t.insert({name:"first task"})>t.insert({name:"second task"})/* add the references from the coll to the tasks */>c.update({name:"first category"},{$addToSet:{"tasks":t.findOne({name:"first task"})._id}})>c.update({name:"first category"},{$addToSet:{"tasks":t.findOne({name:"second task"})._id}})>c.find().pretty(){"_id":ObjectId("5584fe86e2c6cdd3813233ed"),"name":"first category","tasks":[ObjectId("5584fec9e2c6cdd3813233ee"),ObjectId("5584fed0e2c6cdd3813233ef")]}/*getalltaskdocumentsreferencedbythe"first category">t.find({_id:{$in:c.findOne().tasks}}){"_id":ObjectId("5584fec9e2c6cdd3813233ee"),"name":"first task"}{"_id":ObjectId("5584fed0e2c6cdd3813233ef"),"name":"second task"}
I met someone at a hack night worked for Akamai. He was a cool dude, so I was
curious what Akamai does. For whatever reason I ended up reading a paper cited
as so:
Nygren, Erik, Ramesh K. Sitaraman, and Jennifer Sun. “The akamai network: a
platform for high-performance internet applications.” ACM SIGOPS Operating
Systems Review 44.3 (2010): 2–19.
For whatever reason, everything I know about Akamai is from this paper. I’m not
in any way afiliated with Akamai, etc., though the paper is basically a badly
designed advertising pamphlet for their services. A small number of sentences
in this summary may have been lifted directly from that paper. You can consider
this whole post a paraphrasing of the paper. The intent of this post is to
tell you everything important that I learned from the paper in a form that is
way easier to digest than you reading the paper itself. To that end, this
document is maybe 2 pages long, and the original is 17. Clearly much of the
detail has been removed, but the gist remains. If that appeals to you, welcome
aboard! Note that the paper is from 2010, so is probably already out of date.
Introduction
Akamai invented the Content Delivery Network (CDN) concept in the 1990s,
because the raw naive Internet implementation is too slow to conduct a global
business. As of 2010, Akamai delivers 15-20% of global Web traffic.
Disclaimer: This post only claims to represent my current understanding
of the workings of the HotSpot VM’s garbage collector. This understanding comes
from reading the references listed below, as well as making some assumptions.
It is not necessarily fully correct.
Brief Overview
The garbage collector separates the heap into 5 sections of 4 types. I
don’t know whether these spaces are “virtual” (i.e. remapped to discontiguous
“pages” like Linux’s virtual memory addresses) or not. I suspect they are
not.
Eden — where objects go when they’re first allocated in the running
program
Survivors 1 & 2 a.k.a. “young space” — objects in Eden are moved here
if they survive a minor (“young”) GC
Tenured (a.k.a. Old) — long-lived objects (that have survived [a
configurable number of] minor GCs) are moved and then live in here
We can tell the JVM to allocate all objects larger than n bytes
directly into the old space.
Permanent — this is where the JVM’s own objects live (e.g. classes and
JITed code). It behaves just like the tenured space.
I read a great article by Jay Kreps, one of the dudes who brought you Apache
Kafka. The article is called The Log: What every software engineer should know about real-time data’s unifying abstraction. The
article took me a few hours of mildly challenging reading, so I figure you can
spend a few minutes reading my summary and decide more clearly whether you want
to put in the investment of his explanation.
First, he describes the following concept: Let’s define a log as a total
order of functions called and the parameters passed to each of those function-
calls. Let’s define a “commit log” as a log of edits of the contents of a
database. Kreps notes that we can use a commmit log to build the state of a
database at any point in history. This is like git; as we patch in each
commit, we obtain the state of the repo at each time. If we feed the same log
to the same program on multiple machines in a cluster, (assuming none of the
functions called are non- deterministic and machines behave as we expect), we
will certainly have the same state on each machine after they have each
executed the log. This would be desireable perhaps to provide a “reliable”
service for which there are more reads than writes, and more reads than can
be handled by one node; then we can ensure any node is OK to read from by
having all nodes play functions as prescribed by the log.
Then, he describes the following situation: getting every part of a tech-
company’s data to every service that needs it is very complicated. In the worst
and most naive case it would be N2 because each of N places would be sending
data to each of N places. That’s a lot of network bandwidth, and complexity,
and data formats to understand, and places for things to screw up. So Kreps
suggests
multi-paxos as a classroom:
teacher := proposer
students := acceptors
all := learners
teacher to students:
may I have your attention for the next 1:15 hrs?
I'd like to teach youz guys all about "the paxos protocol"
here is what I'm assuming you already know
each student to teacher:
if not already busy for that time:
yes, here's how much I actually already know about paxos
else:
my apologies, but I will be busy skimming facebook for the next 1:15
so if there's something you want to tell me, post it on there
teacher to each student:
given how far behind you are, these are the things you need to catch up on
but also this is what I'm teaching now
each (alive & correct) student to teacher:
oh very cool!
this is how much I now understand of all the stuff you've said
teacher:
if the majority understands more than before:
set new start point for next time's lesson
For my Distributed Computing class at school, there were 3 projects that each
involved implementing a distributed protocol over a simulated distributed
cluster of computers that are only ‘connected’ to one another in that they know
each other’s IP address and must communicate over TCP sockets. For each of the
3 projects, I had a completely different method of implementing the system.
Implementation One: A True Distributed System in Vanilla Java
The first project (Three Phase Commit) was the only one for which I worked with
a partner. We were unclear how to attack the problem of building a distributed
system in a simple way, so we went ahead and implemented a truly distributable
Java application, in which you could spin up processes on any computer and it
would connect in to the master server, who would tell the new node the
listening ports of other servers in the cluster manually, etc. This turned out
to be perhaps overly ambitious. When a node was expecting a message from
someone else, it would fire a timer thread, which on completion would report to
the master to restart the failed process. This system did work properly
(somehow) but it seemed to be living-on-the-edge. It also required two weeks of
a lot of working on it and learning aspects of Java, unit testing, mocks, which
Exception would get thrown by which network event & when, etc.
Implementation Two: Lightweight System Simulation
For the second project (Paxos), I said “Hooey” to all the Java hubbub, which
while I enjoyed its low-level-ness, forcing me to peak further under the hood,
it required a larger time commitment than I was prepared to make. I realized
that my class’s requirements only specified that system “nodes” run
concurrently, share no memory, and talk over TCP. So in Scala, I made a Node
class state machine, and ran different instances of it on separate threads.
This was too easy though, and I was disappointed to find my implementation
complete more than two weeks before the due date.
Implementation Three: Akka Cluster
Akka Cluster is a Scala-based framework for building a distributed system, in
which you define the address of a set of “seed” nodes, of whom at least one is
supposed be available at all times. Then when other nodes start up, they become
a “member” of the cluster by contacting any seed node. So for me (the “client”
of Akka Cluster), all there is left to do is define the protocol to run on top
of the cluster. (In this case the protocol was the simple eventually-consistent
database protocol “Bayou”.)
Akka nodes follow the “Actor Model”, meaning they share no state with any other
part of your program, making them “simple” to run on remote machines (I didn’t
try that, the docs say it is easy though). The only way actors can communicate
is via message passing, for which there is a dedicated syntax, !. For
example:
The code above would have the actor running this code, send anotherActor an
instance of the case class MyMessage with the DataElem that was passed in.
Inter-actor communcation is guaranteed FIFO by Akka, meaning that for any
pair of actors A, B, if A sends n messages to B, B will receive those
messages in the same order that A sent them. NB this says nothing about the
order with respect to other actors in the system.
Using Akka trivialized all sorts of implementation details I’d had to worry
about in the first project, and then ignored for the second project.
If there had been a 4th project for this class, it’s clear to me that I would
use Akka again. For some reason my code ran very slowly, and it probably had
to do with some configuration-setting that I did not set in the Akka config
file. That was really the only problem I had that I did not fix. One problem I
did fix that took a while, was figuring out a good way to shut the system down.
There were various options discussed online, and eventually I found one that
worked for me, which was to call system.shutdown() on every instance System I
instantiate. Another option was to send self ! PoisonPill but for some reason
that didn’t work. I guess different ones work for different situations.
Before taking a Distributed Computing course at school, I had written for
various program environments:
a script that trains a neural network to recognize whether an English
sentence is grammatical
an iOS app that involves a fair bit of multi-threading
a command line interpreter that forks and runs commands as background
processes
the Raft distributed consensus protocol inside a professor’s multi-threaded
distributed system simulator
some asynchronous features for a simplified database system where threads
communicate solely via Java’s PipedInputStreams instead of using shared
variables
Perhaps you are familiar with these situations, and like me thought they might
be “distributed”. Reality check: they’re not. Similarly, in a distributed
system, one has multiple “threads” processing data at once. However,
these “threads” may be each on different machines; this adds some complexity.
None of the situations outlined above have the sort of issues one can expect
fairly regularly when talking amongst machines.
This semester I have been taking Distributed Computing with Professor Lorenzo
Alvisi. Distributed computing has been something I have been curious about for
a long time, and aside from writing some MapReduce jobs during a summer
internship 2 summers ago, I had no real experience thinking about distributed
systems. The class was very enjoyable.
The main lesson was that distributed systems are very flaky places, but you
can arrive at some surpringly simple solutions to most of the problems once
you find a better way of expressing what you really want. The other main lesson
was that finding a better way of expressing what you really want can be a
confusing thing to do.
Scenario 1: “Two Generals’ Problem”
Persons 1 and 2 would like to be sure that both agreed on a time to
meet, but neither can tell whether the other is listening to what is being
said.
Person 1: Let’s meet up at 2:00pm
Now person 1 doesn’t know whether person 2 was listening though. So person 2
feels compelled to respond (while still multi-tasking).
I spent quite a while struggling with the Java class declaration syntax, and
now I am starting to get it.
Problem simplification
I started out writing this post because I was totally right and the Java
compiler was wrong, and so on, but now I have cleared it up and I was wrong
and the compiler was right after all (surprise!). I was initially intending to
use this post as a way of getting satisfaction out of how much smarter I am
than the compiler, and instead I will use this space to explain to the next
how to answer similar questions. I started by heavily simplifying my issue
down to its core, and attacking the simplified problem instead of the original
one with extraneous variables and interconnections. I’ve never approached a
programming problem like that, and let me tell you: seems like it worked. By a
twist of fate, the new version looks like a LinkedList, that’s not what I
started out doing, but who knows what’ll happen when you keep smushing the
problem from a larger context into a little jar.
First, we have a Base, which holds an element elem of any type, and a
reference to another Base, quite like a linked list.
Next, we have an Ord, which is a Base that is Comparable. We require
that the element contained by an Ord is itself comparable, and furthermore
that the method of comparing Ords is to compare their respective elems:
This works fine, but note the following limitation:
main.java
1234567891011121314
publicstaticvoidmain(String[]args){Ord<Integer>a=newOrd<Integer>(3);Ord<Integer>b=newOrd<Integer>(4);a.compareTo(b);// this works just finea.next=b;// this does not work because a.next is of type Base<Integer>a.compareTo(a.next);// this DOES work though because a.elem is of type Integer which is Comparablea.elem.compareTo(a.next.elem);}
After writing and re-writing this blog post over and over, this makes sense
now. I used to be confused: “shouldn’t the compiler know that a.next is an
Ord?” And the answer is (now obvious) “Of course not.”