With Pith

Ethan Petuchowski

On Learning

I like to learn things, but I often feel like time I spend learning things is wasted.

It’s hard to tell what info is going to be useful and what is fluff

Classes in school (at all levels) lead to lots of wasted effort, because a lot of the stuff you’re supposed to learn is simply not useful knowledge. For example, for one college exam, I had to name about 30 different rocks and minerals by sight, and the names are very complicated. People say: “well just going through the process is teaching you how to think.” I don’t disagree, but I think that’s a fallacy; learing to think and useful knowledge are not mutually exclusive. Useless knowledge includes specific dates, long names, irrelevant historical events, and other things we might in everyday life dismiss as “trivia”.

In my experience, all academic disciplines have room to reduce the “trivia” overhead in their curricula.

In college computer science classes, there is not a whole lot of trivia. However, their is a “jump into abstraction” that often left me confused. I quite often didn’t understand why what I was learning was important. Edsger Dijkstra said we should teach the abstractions before rotting the students’ brains with modern programming realities and deficiencies. That’s a laudable mission, but problems still need to be better motivated.

Difficult material with no motivation is not rewarding

E.g. during my first Operating Systems course. Everything seemed way more complicated than anything I should ever be expected to know in real life. So I figured I would never end up needing to understand virtual memory, processes and threads, scheduling, networking, etc. In reality, I just had no idea what I was talking about, and that’s why it should have been motivated. For example, I have since learned that the Linux kernel has a very interesting open source development ecosystem. They are making upgrades to the thing all the time that affect everyone developing most kinds of modern software. In addition, a whole lot of the different features of computers are motivated by internal business tool usecases, rather than home consumer usecases. Being a college student without a computer nerd background, I had no idea about any of that. Being taught to appreciate how important this stuff would end up being for me would have made me learn drastically more during the course of the class.

What I wanted at that time was for the material to be motivated with an example, like, “let’s build a program for running a medical radiation machine.”Now we’re talking; we’re going to need to get all the different low-level components right and make them fit together so that we can save lives!

What actually happened was similar in content, but not in objective. We were expected to read a very long and dry paper on the Therac-25, a radiation machine with concurrency issues that killed a few people in the 1980s. I spent a few hours trying to read the paper. One could say I was spending that time “learning to think”, but I think I spent that time “getting nowhere”. The whole time I was reading the paper, I was thinking: it would take me so long to get to the point of this paper, and I’m not even going to get a whole lot out of it. No good. In the end I learned about the Therac-25 from Wikipedia, a source of information whose expected readership has a level of background knowledge more commensurate with my own. Then, a few years later, I had to read that paper again for a graduate operating systems class, and at that point I had sufficient background knowledge to simply read the paper and feel like I understood its content and learned important lessons about software engineering.

This example demonstrates the fundamental principle that the same content can lead to completely different learning outcomes for different people, and even for the same person at different points of time and (as we shall return to later) in different emotional moods.

Some keys to not wasting effort

  • Have something in mind that you want to accomplish with your newly-obtained knowledge
    • E.g. “I want to build the software for a medical radiation device”
  • Have someone (colleague) or some place (e.g. stack overflow) where you can ask questions when you get confused
    • Sometimes having another person just explain the whole thing in one shot face to face can lead you to simply just get it
  • Learn the relevant vocabulary for the field on wikipedia
    • For example take at least half an hour digging through linked topics until you have a general grasp of the various concerns and their names, and ideally how they relate to one another
  • Skim liberally but don’t expect to understand what you skim

First Glance at Genomics With ADAM and Spark

At work, we have a Spark cluster. One of my first responsibilities was to make it more reliable and efficient. So I looked on Github to see how people actually use Spark, and what magic they use to get their clusters not to crash in the process. This is how I found ADAM, “a genomics analysis platform built using Avro, Spark, and Parquet”. Then I looked in the repo’s contributors list, and watched a few lectures by Frank Nothaft and Matt Massie, two of the project’s main contributors. What I heard there was pretty cool.

In short, they’re looking to build systems that will “one day” recommend more effective treatments for diseases including cancer and Alzheimer’s within an hour of receiving a patient’s DNA sample. They describe several components of what needs to be done to make [research toward] this possible.

Hdfs Output Stream Api Semantics

Writing to files can get tricky. You have to think about the semantics you want, versus any performance imperatives, etc. Here, we look a briefly at the Linux file system API, and then contrast it with a brief look at the Hadoop File System (HDFS) Java API.

Linux file API

In the normal Linux file system API, there are various ways to “flush” a file. Here are a few of the ones I have seen.

We have fflush(3), which flushes all user-space buffers via the stream’s underlying write function. This data may still exist in the kernel (e.g. buffer cache and page cache [since 2.4, Linux buffer cache usually just points to an entry in the page cache see quora]).

We have fsync(2), which flushes modified pages of data from the operating system’s buffer cache to the actual disk device, and blocks until this has completed. Modified metadata (e.g. file size) is also written out to the file’s inode’s metadata section.

We have close(2), which closes a file descriptor, but does not cause flushing of any kernel buffers.

We have fclose(3), which closes the file descriptor, and flushes its user space buffers (like fflush(3)).



Hadoop File System (HDFS) file Java API

In this API the names of the functions are similar, but the semantics are quite different.

In HDFS, a “file” is stored as a sequence of “blocks”, and each block is is globally-configured to be e.g. either 64MB or 128MB in size. Each block is separately stored on the configured number of machines, according to the chosen HDFS “replication factor”. For the instance of Linux running on a particular node in the HDFS cluster, a block is a file that Linux must track just like it would any other file: with a page/buffer cache (see above), inode, etc. Tracking and deciding which blocks belong to each HDFS file, and on which nodes each of those blocks are stored is the responsibility of the HDFS NameNode (i.e. the single master node).

But the whole block-level view of HDFS is not (directly) visible to the HDFS client API. Instead, a client simply opens an OutputStream to a file by telling the name node that it either wants to create a new file, or append to an existing file. The NameNode responds with nodes that should accept the first block of data. The client starts writing to the first DataNode willing to take its data. That DataNode, pipelines this incoming data to the other DataNodes responsible for replicating this block.

Similar to the Linux file system API above, just because bytes are being “written” by the client, does not mean they’ll necessarily:

  1. be visible to someone who now tries to the read the file
  2. be reflected in the current metadata available about the file (which lives in the NameNode)
  3. survive crashes of
  4. the client or
  5. the DataNode(s)

Similar to the Linux file system API above, we have a few methods we can use to decide the buffering semantics we want of our pending writes.

We have hflush(), which flushes data in the client’s user buffer all the way out to the nodes which are responsible for storing the relevant “blocks” of this file. The metadata in the NameNode is not updated. Data is not necessily flushed from the DataNodes’ buffer caches to the actual disk device.

We have hsync(), which is just an alias for hflush.

We have close(), which closes the stream, makes sure all the data has arrived at all the relevant nodes, and updates the metadata in the NameNode (e.g. updates the file-length as seen from the hadoop fs -ls myFile.txt command line interface).

In my experience, it is not safe to be opening and closing the same files from the same instance of the Hadoop client on different threads. Maybe I was naive in thinking this would be OK, as the Linux man pages given above seem to suggest that this would be problematic even with the direct Linux file system API.


Ramblings on Insight

I’m re-designing a program I’ve been working on for a few months. I’ve implemented a prototype for the new design and also started implementing it. Because I haven’t thought this through completely or written much of the code yet, there is still room for fundamental issues to come up with plugging this model into my codebase which will make it impossible for this major refactoring to complete. But really, what I expect is that this model allows us to express what is really going on in such a way that by looking at the names of the components, we know where to start looking for code describing what is really happening. If that is true, and we write function names from the top-down (i.e. start with “main” and go down, as a breadth- first search), then whole bunches of code will be isolated in that they survive to create the capability of some unique higher-level function (or a small set of functions).

Even though I have a clearer conception of how to go forward than I did when first writing the program. I still find it hard to think through the whole problem without just trying it. But if I just try it then I don’t know what I’m doing. Then, after I learn what the pieces I need are, I can start from the top-level design and then go down.

I don’t know if there’s a particular moment when I am ready to see the problem more fully. In this case, it seemed to happen because I had to find a better way. The program was going to be a lot of trouble to keep dealing with if I didn’t figure out a better way to name and organize the different concerns it actually addresses. By naming and organizing things better, we get a little “registry” of package, class, method, and value names that reveal the solution’s structure.

Refactoring can get a bit tedious. But maybe the most tedious bits are often not really worth doing. I hope to build up a better intuition for it. But for now, it’s one of those things where I don’t know the words to describe what I’m really making. I can see how knowing a bunch of “patterns” would make this easier. Especially considering that the crux of the pipeline I implemented is the “observer” pattern, one of the only patterns I know. But I think the main thing is to just try things out, see what works, and what doesn’t, then go back and learn the “patterns” after I’ve read and written a bunch of different designs in the code.

Form in ‘Main’ Follows Program Function

My program is a pipeline that takes multiple data sources, transforms them, mashes them together, and writes them to multiple locations. It does this in a somewhat resilient way by using Kafka as an internal buffer and data bus. However you would have no idea from the structure of the program that that is what is going on. In the “main” method, all that happens is a few configuration settings are overridden, and a server is started. That doesn’t tell the reader anything about what’s happening.

Since I’m using Scala, the new design makes the “main” function look more like a Unix program:

val src1: DataSource[Type1] = Type1Source()
val src2: DataSource[Type2] = Type2Source()
val merger: Merger[Type1, Type2, Type3] = OneAndTwoMerger()
val output1: DataSink[Type3] = Dest1Sink()
val output2: DataSink[Type3] = Dest2Sink()

Merge(src1, src2) | merger tee (output1, output2)

// desugared to show there is no magic
Merge(src1, src2).|(merger).tee(output1, output2)

Ok, so data streams emanating from source-1 and source-2 are merged together by a type-compatible “merger” class, who writes its output stream into both output-1 and output-2.

There’s not a lot of code required to create those interfaces and methods. Basically any number of DataSink[T]s can be “observers” of a DataSource[T]. Whenever a DataSource finds itself with data to publish, it calls the receiveMsgs(msgs: Seq[T]) method of all the observing DataSinks. So now we have a “reactive” (sources produce data whenever it is available to them), and typesafe pipeline where components can be swapped in an out. Communication between sources and sinks by default is just function calls (i.e. synchronous), but their calls could be wrapped with Futures or Akka actors. Using function-calls makes coding, testing, debugging easier, has better type-checking, and doesn’t need backpressure. Increased asynchrony would allow for higher speeds, but is not needed yet, and will be hooked-in as-needed.

The biggest influences on this design are the Unix shell, and the Akka Streaming library, which I saw some presentations about. I think both were inspired by electrical engineering (e.g. circuits and signal processing).

With this approach, each component has a single responsibility: to ingest, filter, transform, aggregate, or output streams of data. Then in the “main” function we just assemble the data flow of the program by hooking components together. This means to test the program, we just need to test that each component produces or consumes the data that it says it does properly.

Before, almost all of my tests involved at least three separate major program components. I think I will start by re-writing those, and wherever things don’t work, write lower-level tests of one thing, and keep zooming in like that. That way, testing effort is spent on the parts that are hard to get right. I’m not writing the tests first because most of the code for the program is simple hooking things into each other. Testing that would be an unecessary duplication of effort. If the main logic is so plain to see and understand and will not undergo heavy modification, it does not need to be written twice. Then there are a few bits that use some pretty difficult external APIs that can be used well and can be used badly. I want to make sure that I’m using those at least as well as is necessary for the program to function properly. Most of the issues I’ve had in the past are with the HDFS API. With HDFS, it takes to take a little while sometimes for opens, writes, and closes to propagate properly to all the replicas. Before I knew that, I was using the API sub-optimally, and the program would crash every twelve hours or so. That problem itself would not be simple to test against, but it gives the impression that interaction ith these external APIs is where the main complexity in my program lives.

In this new “source-to-sink” program model, a single Kafka “topic” can be implemented as an object (i.e. Singleton) that has two ends (fields): a Producer (which is a DataSink[T], since it writes data out of the program), and a corresponding Consumer (a DataSource[T] for the program).

So if the program has two “main” functions, one connects to the Producer side of a Kafka topic, and the other connects to the Consumer side, all using this Unix-like Scala DSL, then we have integrated Kafka as a resilient buffer connecting two stages of the pipeline. This means the computation subgraph connected within-JVM to the Consumer side can be taken offline for fixing or augmentation without losing ephemeral data being collected by the Producer side.

Name According to Function

Over the past few months, I wrote a kind of crappy program. Now I need to make additions to that program and there are a lot of internals about the program that I need to recall in order to be able to implement the additions efficiently and robustly. This is not a world I want to continue to live in, so my crappy program requires some sort of improvement. I looked to Amazon for a book with the answers on “what specifically to do”. Based on its cover, title, and reviews, I ended up with the book Clean Code: A Handbook of Agile Software Craftsmanship, put together from multiple authors by a guy who refers to himself as “Uncle Bob” (real name Robert C Martin). The book is frankly a big part of the answer I was looking for.

I don’t know a whole lot about Uncle Bob, but he seems to be very experienced with designing, writing, maintaining, refactoring, etc. large Java projects. He also is very well-read on modern software engineering, but for the first quarter of the book, manages to stay away from the over-use of terminology I’m not familiar with. After that it goes into things like agile, test driven development, behavior driven development, cohesion, object oriented programming patterns (e.g. Gang of Four), plain old Java objects, data access objects, etc. that I don’t have much familiarity with. He also talks about his conversations with (the only) guys in this arena that I have heard of, including Fred Brooks and Martin Fowler.

His writing tone can be characterized as follows

  • I have read and written sooo much more code than you
  • Over time, I have taken the time to think about the best way to make the code ‘clean’
  • Herein, I shall share with you what I have learned
  • I’ll use the best didactic methods I can think of
  • I hope you benefit as much as possible from my wisdom

His main didactic method can be characterized as

  • Here is an example of some bad code
  • This is how it goes wrong
  • This is why that is bad
  • Let’s improve it in these areas
  • Here is the improved code
  • Note how this new code doesn’t have the flaws of the previous

So far I have read perhaps ½ of the book, and I have already come up with and partially implemented a design for my problematic program that is so much better than the old design, and a rough prototype implementation is already running, and I have much more “direction/vision” for how it is going to progress. That shows me that the Clean Code book has produced an incredible return on investment.

The main thing I have understood from the book is that parts of programs should do what they say they do. Put most briefly, there are two parts of making that possible

  1. Components should be simple enough to be described by just a few words
  2. The name of a component should be the few words that describe it

It is embarrassing to say that I never thought of that myself, but the truth is I didn’t, and that is reflected all-too-obviously in the program that I wrote. I can come up with countless excuses for why I wrote the program that way, but that doesn’t fix the problem. The only way to fix the problem is to reorganize the program to conform to the above two rules.

Bigtable Paper Summary

When looking into what Cassandra and HBase are, and their relative strengths and weaknesses, people often seem to think they can get away with the following very succinct characterizations: “Cassandra is like is Dynamo plus Bigtable, and HBase is just Bigtable”. I don’t know much about Dynamo or BigTable because we skipped those papers in my systems courses. So to get started understanding what’s going on with all this mess, I decided to read the Bigtable paper. What follows is a brief summarization/retelling of the Bigtable paper. It follows roughly the form of the paper, especially in that it starts high level, and then digs slightly lower-down into the implementation. It contains basically only and all of the parts of the paper that I found illuminating, but broken down into sentences that are hopefully easier to understand.

What problem are we solving?

Bigtable provides an API for storing and retrieving data.

It is most useful if

  • there is a lot of data coming in at a high rate over time
  • there is no need to join each data table with another
  • data might need to be updated
  • range queries are common

Bigtable is a distributed database. It is a database management system which allows you to define tables, write and update data, and run queries against the stored data. It is similar to a relational database, except that it is not “relational” in the sense denoted by the term “relational database”. Instead, it brings its own type of data model, where instead of storing data in normal two-dimensional table cells, you store it according to a new set of rules that allows for flexibility in the shape of each record, while still enabling overall efficiency.

What Is a Rails Application

What is a server and how does it relate to my Rails app?

I learned the basics of Ruby on Rails web application development a few years ago. At that time, my understanding how that system works was as follows

  1. Find articles describing things Rails allows you to do easily
  2. Decide on an app that requires only those things
  3. Follow the instructions in those articles to implement your app idea
  4. Push the app to Heroku
  5. Now the app is accessible to anyone on the World Wide Web

The fact that it is so easy to do such a thing is nothing short of magical. Especially while you have no idea how any of it works. Now that I am slightly more knowledgeable about how software systems are put together and deployed, I’d like to take a slightly more nuanced stab at what really was going on when the above 5 steps were executed.

Simple First Deployment

I just had my first experience of “deploying my system into production”. I have been learning about software engineering for a few years now, and I have seen this term “deploy into production” many times, but never experienced it myself. The “software system” that was deployed is an internal tool, almost 3 months in development by yours truly. This article is a retrospective of the development of this system so far.

What Actually Is SSH

SSH Tunneling

“Tunnelling”, with two ells, is the British spelling.

A few months ago, I downloaded a tool (an ELK stack) that didn’t work right off the bat due to some sort of misconfiguration. It was running in a Vagrant- made virtual machine (VM) on my laptop. The Vagrant setup script had forwarded a local port on my laptop into the VM. So in order to debug it, my coworker configured a chain of tunnels that enabled him to SSH into the VM on my laptop.

In the back of my mind, I spent the next few weeks trying to figure out what SSH port-forwarding is and how its syntax works, then another few weeks to figure out what reverse port-forwarding is, and another few weeks to find practical use-cases for each.

Here is my executive summary

ssh -L [<localhost>:]<localport>:<remotehost>:<remoteport> <gateway>

By default, <localhost> will be localhost.

What this does is, start a serversocket listening to local address localhost:localport using the “SSH client”. When a client establishes a connection to that address, traffic received from that client will be encrypted, and forwarded to the sshd[aemon] listening on port 22 of gateway. (Only) after the gateway receives this traffic, the sshd will establish a (normal, unencrypted) connection to remote address remotehost:remoteport, and forward the data originally received by the SSH client there. Response traffic originating from remotehost:remoteport will go back to the sshd, back through the encrypted tunnel to the SSH client, and back to the originating client.

Reverse tunneling by contrast, means that traffic originating on the remote end will be forwarded to the local end.

What’s SSH

One thing that confused me about SSH forwarding, is that if I don’t use some extra flags to disable it, when I set up port forwarding to a another machine, I also end up with a shell ready executing commands remotely. What is going on here? It turns out it is doing what is called “remote command execution”.