Technical Journal

An OSI layer model for the 21st century

2014-04-24T17:48:03-04:00

The Internet protocol suite is wonderful, but it was designed before the advent of modern cryptography and without the benefit of hindsight. On the modern Internet, cryptography is typically squeezed into a single, incredibly complex layer, Transport Layer Security (TLS; formerly known as Secure Sockets Layer, or SSL). Over the last few months, 3 entirely unrelated (but equally catastrophic) bugs have been uncovered in 3 independent TLS implementations (Apple SSL/TLS, GnuTLS, and most recently OpenSSL, which powers most “secure” servers on the Internet), making the TLS system difficult to trust in practice.

What if cryptographic functions were spread out into more layers? Would the stack of layers become too tall, inefficient, and hard to debug, making the problem worse instead of better? On the contrary, I propose that appropriate cryptographic protocols could replace most existing layers, improving security as well as other functions generally not thought of as cryptographic, such as concurrency control of complex data structures, lookup or discovery of services and data, and decentralized passwordless login. Perhaps most importantly, the new architecture would enable individuals to internetwork as peers rather than as tenants of the telecommunications oligopoly, putting net neutrality directly in the hands of citizens and potentially enabling a drastically more competitive bandwidth market.

	Current OSI model	In practice	Proposed update
8	(none)	Application	Application
7	“Application”	HTTP	Transactions
6	Presentation	SSL/TLS	(Non-)Repudiation
5	Session	TCP	Confidentiality
4	Transport	TCP	Availability
3	Network	IP	Integrity
2	Data Link	e-UTRA (LTE), 802.11 (WiFi), 802.3 (Ethernet), etc.	Data Link
1	Physical	e-UTRA (LTE), 802.11 (WiFi), 802.3 (Ethernet), etc.	Physical

Of course, the layers I propose will doubtless introduce new problems of their own, but I’d like to start this conversation with some concrete ideas, even if I don’t have a final answer. (Please feel free to email me your comments or tweet @davidad.)

Descriptions follow for each of the five new layers I suggest, four of which are named after common information security requirements, and one of which (Transactions) is borrowed from database requirements (and also vaguely suggestive of cryptocurrency).

General disclaimer for InfoSec articles: Reading this article does not qualify you to design secure systems. Writing this article does not qualify me to design secure systems. In fact, nobody is qualified to design secure systems. A system should not be considered secure unless it has been reviewed by multiple security experts and resisted multiple serious attempts to violate its security claims in practice. The information contained in this article is offered “as is” and without warranties of any kind (express, implied, and statutory), all of which the author expressly disclaims to the fullest extent permitted by law.

Data Link and Physical layers

For our purposes today, the Data Link and Physical layers are a black box (perhaps literally), to which we have an interface (the “network interface”) which looks like a transmit queue and a receive queue. These queues can store “payloads” of anywhere from 1 to 1280¹ octets (bytes). The next layer in the stack can push a payload onto the Data Link transmit queue (and possibly get an error if it’s full) and can pop a payload from the Data Link receive queue (and possibly get an error if it’s empty). The Data Link layer is responsible for (eventually) flushing the transmit queue, and any payload which leaves the transmit queue must appear on the receive queues of all other devices connected to the same channel (a technical term, which may refer to a radio channel in the case of cellular devices, or simply to a particular length of cable in a point-to-point wired connection).

Integrity layer

We would like a received payload to self-evidently be the same payload which was sent. Although the Data Link layer is supposed to provide such an assurance, various kinds of attacks on the system might invalidate this assumption. Integrity protocols mitigate these attacks:

Paranoia Level	Attacks	Mitigation	Common Implementation	My Preferred Implementation
1	Thermal noise, cosmic rays	checksum hash	TCP Checksum	CRC-32C
2	Deliberate corruption	cryptographic hash	SHA-1	BLAKE2b
3	Spoofing of trusted contacts	keyed hash	HMAC-SHA1	SipHash
4	Spoofing of strangers	public-key signature of cryptographic hash	SHA-1 + RSA	BLAKE2b + Ed25519

Integrity protocols are fairly simple: the appropriate verification material is placed at the beginning of every Data Link payload. The Integrity layer exposes the same kind of “transmit queue and receive queue” interface as the Data Link layer, but the payload which can be passed to the Integrity layer must be somewhat smaller, so that there is room for the verification material and the Integrity payload together to fit into 1280 octets. Overhead ranges from 4 octets for a CRC-32C checksum to 96 octets for an Ed25519 signature.

In the keyed hash case, some state is necessary at the Integrity protocol level: each API customer must be able to add “trusted contacts” to its “address book” by specifying a symmetric key corresponding to a given endpoint name (which may have been negotiated at a higher protocol level, or simply out-of-band entirely). Since some advanced higher-level protocols may define symmetric authentication keys that are only good for a single use (e.g. Axolotl ratcheting after the handshake phase), “address book entries” should be single-use by default, with renewal explicitly required after each payload received from a given contact.

Availability layer

We would like networked endpoints to be available to receive packets from other endpoints in a way that is robust to unannounced changes in network topology. This layer conceptually takes the place of the Network layer in the original model, as it will be responsible for routing packets. Significantly, in this proposal, there are no “hosts” or “ports”: only “endpoints”, identified by public keys. This is simply taking the end-to-end principle one step further, by considering the “host” merely part of the network infrastructure which makes applications available.

A fully implemented Availability layer should provide unicast (deliver to a unique endpoint authenticated by a given public key, wherever it may be), anycast (deliver to nearest endpoint authenticated by a given public key), and multicast (a.k.a. pub/sub: route to all endpoints who have asked to subscribe to a given ID, and provide a subscription method).

Routing Semantics	Current Reliability	New Implemenation
Routing Semantics	Current Reliability	Overlay on existing Internet	Native Mesh
Multicast	awful	S/Kademlia message broker	Straightforward extension of unicast
Anycast	decent	No advantage over load balancers	Possible extension of unicast
Unicast	excellent	Special case of multicast	Electric Routing

I believe the Electric Routing algorithm² is up to the challenge of replacing unicast³, and that it could be extended to provide multicast and even anycast, but other algorithms could be developed at this protocol layer as well. The first real-world implementation of the system I’m describing will very likely be developed as an overlay network on top of IP, in which case multicast can be implemented simply atop S/Kademlia, with unicast as a special case, and anycast can be emulated with standard load-balancing techniques.

The tradeoff here is that routers have a lot more work to do, since there are no “addresses” corresponding directly to geographic location. But, it means that every node on the network can participate as a router, so there is a lot more capacity to do that work. In addition, the endpoints-only scheme has many potentially desirable properties with respect to features like pseudonymity, NAT transparency, redundancy, and decentralization of the telecommunications market (especially in densely settled areas).

Confidentiality layer

Ideally, we would like to not transmit any information to anything other than the destination endpoint(s). This ideal is not in general achievable on a public network, but some types of mitigation are possible:

Paranoia Level	Attacks	Mitigation	Common Implementation	My Preferred Implemenation
1	Sniffing payloads to trusted contacts	symmetric encryption	AES	ChaCha
2	Sniffing payloads to strangers	public-key encryption	RSA	RSA
3	Chosen plaintext attacks	key agreement + symmetric encryption	ECDH + AES	Curve25519 + ChaCha
4	Key compromise	ephemeral key agreement + symmetric encryption	ECDHE + AES	Axolotl ratchet with Curve25519, SipHash, PBKDF2, ChaCha

In cases 3 and 4, this layer has to maintain some state, holding session keys or message keys, and the Axolotl ratchet is a little complicated; but this layer does not have to worry about the verification of identity (which will be provided on a higher layer, by services such as keybase.io or using pronounceable hash fingerprints) or integrity (which will be provided by a lower layer).

Non-Repudiation and/or Repudiation layer

We would like for a receiver to be sure that a message they receive was sent by a given sender, and we would like for a sender to be sure that a given message was successfully received. Sometimes, we would also like for a receiver to be unaware of the location a message was sent from. The result is three related but orthogonal protocol types, which may be nested:

Repudiation Property	Meaning	Protocol
Non-Repudiation of Sending	Recipient knows immediate sender	Sender includes a hash of their public key in the message. To understand why this is necessary given the Integrity layer, read this excellent article
Non-Repudiation of Receipt	Sender knows message was received	Recipient must send a signed acknowledgement for every message. This also implements “reliable delivery”
Repudiation of Origin	Message is difficult to trace	Onion Routing

Transactions layer

We would like for sets of nodes which wish to maintain common mutable state variables to be able to do so, even in the presence of various types of adversaries. This is a common abstraction for the requirements of git, cryptocurrencies, and distributed databases (i.e. ACID MVCC). I propose that (borrowing most directly from git, but also from Clojure’s concurrent data structures) changes in large or complex mutable states be represented as changes to the root of a Merkle tree, thus reducing the state subject to transactional semantics to single-packet size⁴.

To make it obvious what I’m intending to refer to, the owner of a particular “domain name” or a particular “coin” (or, generally, any cryptographically controlled resource) is an example of a mutable state. But so is, for instance, the contents of any social media profile, email inbox, hypertext page, or source code repository. These things could all be managed without reference to central authorities or single points of failure.

Paranoia Level	Attacks	Mitigation
1	Asynchrony; node failure/disconnection	D1HT tracker
2	Sybil attacks; eclipse attacks; churn attacks	S/Kademlia tracker
3	Malicious trackers	Leaderless Byzantine Paxos or Byzantine gossip
4	Any attack that Bitcoin can survive	Block-chain protocol

Many (including myself) have claimed that the core contribution of Bitcoin, the block-chain protocol, is a novel solution to the Byzantine Generals Problem, but it turns out this is somewhat misleading. Although the block-chain protocol is Byzantine-fault-tolerant in a novel way, there has been plenty of research on Byzantine protocols over the years, and it seems probably unnecessary to constantly “mine,” i.e. solve cryptopuzzles, to achieve Byzantine fault tolerance. The main reason to introduce cryptopuzzles is to reduce the efficacy of Sybil attacks, in which one malicious actor fabricates arbitrarily many identities in order to exceed the Byzantine fault tolerance threshold and control the system. However, these attacks can also be mitigated by requiring crypto-puzzles only for joining the network (as in S/Kademlia), and by blacklisting nodes which behave suspiciously (the latter being how most attacks on Bitcoin are stopped in practice).

Application layer

In such an environment, applications (or application components!) are essentially just maps from one mutable state to another, in functional reactive programming style. In the same way that you might encode packet filters into a kernel’s TCP/IP stack today, you might encode entire applications into a kernel’s “mesh” stack in the future. Various search functions, including full-text search, could be provided using the OneSwarm approach or potentially by distributed Bloom filters implemented atop this platform (an idea due to Andrée Monette). Resource control and access control can be provided by means of cryptographic capabilities.

But, in general, this layer is completely open for all sorts of applications. Essentially, any end-user service that runs on a network (and what doesn’t, these days?) would fit here.

Conclusion

I’ve outlined some radical ideas for how to re-build the Internet protocol stack in a way that is ultimately more coherent with Internet cultural values (freedom of expression, pseudonymity, reduced potential for abuses of power). This outline still needs quite a bit of work and thought before being turned into implementations, but I feel like I’ve reached a turning point in making my ideas about next-generation architectures concrete, and at a timely moment with respect to conversations about TLS and net neutrality. If you would like to see these concepts made into working code, please reach out and let me know.

This number is cribbed from the IPv6 RFC.↩
coauthored by Petar Maymounkov, who also coauthored Kademlia, the DHT powering BitTorrent ↩
Electric routing does need some extensions to mitigate various attacks, but I believe the countermeasures from S/Kademlia are readily adapted to meet these needs.↩
This is similar in principle to the trick used by most practical public-key cryptosystems, which use the actual public-key algorithm only to encrypt a key from some symmetric cryptosystem, and then encrypt arbitrarily large content using a stream cipher. The common principle is that you can do the hard security algorithm on a small piece of data, and use easier security algorithms to apply those hard security properties to large chunks of data.↩

All Boolean functions are polynomials

2014-04-14T10:47:37-04:00

…in the integers mod 2 (a.k.a. the finite field of order 2). Multiplication mod 2 is AND:

A	B	(AB)	A B `AND`
0	0	0	0
0	1	0	0
1	0	0	0
1	1	1	1

Adding one mod 2 is NOT:

A	(A+1)	A `NOT`
0	1	1
1	0	0

So, multiplication plus one is NAND:

A	B	(AB+1)	A B `NAND`
0	0	1	1
0	1	1	1
1	0	1	1
1	1	0	0

Since NAND is universal, and any finite composition of polynomials is a polynomial, any finite boolean circuit is a polynomial. Here’s all 16 two-input functions:

Lookup table	Boolean function (RPN)	Polynomial	Polynomial bitmap
0000	0	0	0000
0001	A B `AND`	AB	0001
0010	A (B `NOT`) `AND`	AB+A	0101
0011	A	A	0100
0100	(A `NOT`) B `AND`	AB+B	0011
0101	B	B	0010
0110	A B `XOR`	A+B	0110
0111	A B `OR`	AB+A+B	0111
1000	A B `OR` `NOT`	AB+A+B+1	1111
1001	A B `XOR` `NOT`	A+B+1	1110
1010	B `NOT`	B+1	1010
1011	A (B `NOT`) `OR`	AB+B+1	1011
1100	A `NOT`	A+1	1100
1101	(A `NOT`) B `OR`	AB+A+1	1101
1110	A B `AND` `NOT`	AB+1	1001
1111	1	1	1000

It’s interesting that in many cases, including those corresponding to the “basic” functions of AND, OR, XOR and NOT, the polynomial bitmap is identical to the lookup table.

It’s also interesting that these polynomials are either multilinear (linear in each variable) or the sum of a multilinear polynomial with 1.

Naturally, I’m not the first person to notice this. It was first noticed by I. I. Zhegalkin in 1927. And I haven’t yet found any especially compelling uses of the representation. (If you actually want to represent boolean functions, you’re probably better served by ZDDs.) But I found it an interesting discovery which might just come in handy someday.

Getting started with nginx configuration

2014-04-06T20:05:53-04:00

Thanks to fellow Hacker Schooler Leah Steinberg for inspiring this post!

Having intermittently struggled with apache2 configuration files for the majority of my adult life, I find nginx an absolute joy to set up. I’m completely sincere about that. But, for those who are just getting into Web development, nginx is just about as much of a struggle as Apache used to be—in fact, probably more so, because there’s less abundant learning material out there on the Internet.

So, here’s an attempt to make that situation just the slightest bit better.

If you don’t already have nginx installed, I encourage you to follow these directions for building OpenResty, an enhanced version of nginx that enables building entire Web apps within the nginx process using the beautiful programming language Lua.

But, from here on, I’m going to assume that you already have a stock version of nginx installed. Verify that if you run

$ nginx -v

you get some kind of reasonable response, like

nginx version: nginx/1.2.3

Success!

Now, make a file called hi.conf:

hi.conf

error_log stderr;
pid nginx.pid;
http {
    access_log off;
    server {
        listen 4945;
        location / {
            return 200;
        }
    }
}
events {}

I’ve chosen the number 4945 so as to hopefully not conflict with any services that may already be running on your machine for one reason or another. Now, let’s launch nginx using this configuration file and test it:

$ nginx -p `pwd`/ -c hi.conf
nginx: [alert] could not open error log file: open() "/var/log/nginx/error.log" failed (13: Permission denied)
$ telnet localhost 4945
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 200 OK
Server: nginx/1.2.3
Date: Mon, 07 Apr 2014 01:50:28 GMT
Content-Type: text/plain
Content-Length: 0
Connection: close

Connection closed by foreign host.
$ kill -QUIT `cat nginx.pid`

You’ll have to actually enter the line GET / HTTP/1.0. HTTP is a protocol intended for humans to be able to read and write, and you may as well take advantage of it! Of course, you could also navigate to http://localhost:4945/ in a browser, but then all you see is a blank page, which is not quite as satisfying (to me, at least) as a 200 OK on the terminal¹.

What’s that? You want to actually serve data, and not just a blank page?

hi2.conf

error_log stderr;
pid nginx.pid;
http {
    access_log off;
    root .;
    server {
        listen 4945;
        location / {
            try_files /index.html =404;
        }
    }
}
events {}

Then just drop an index.html into the same folder as hi2.conf and run

$ nginx -p `pwd`/ -c hi2.conf

Now you should be able to load http://localhost:4945/ and see what you wrote in index.html. Exciting!

Next Steps

If you installed OpenResty, continue with their Getting Started. Otherwise, I’ll leave you to other tutorials, or to the actual nginx documentation – this was really just an exercise in getting something to work. But, I will offer this advice: I recommend against using any of your OS’s magic, like special files and folders where things are supposed to be put, or special incantations for invoking nginx. Just run nginx on the command line. It’s a smart enough program to stay running once you’ve started it, without the help of external infrastructure, and I think you’ll be much less frustrated working with it directly, having all the relevant files in one project directory, than struggling to configure both nginx itself and your OS’s favorite mechanism for managing server processes. Once you’ve figured out how to disable the OS’s auto-server-starting mechanisms, you can modify the listen line to listen 80 so you can stop typing that pesky :4945 in the browser.

Reloading

Oh, and one last trick: if you want to ask nginx to reload its configuration file without actually bringing down the server, just

$ kill -HUP `cat nginx.pid`

Happy hacking!

200 is the HTTP status code meaning “OK”, the status that accompanies most successful HTTP replies on the Web. As you might guess, that’s the same 200 referred to by the line return 200 in hi.conf.↩

VNC as a graphical interface medium

2014-03-30T19:21:34-04:00

The Virtual Network Computing (VNC) system for accessing the GUI environments of remote computers uses a protocol called Remote Frame Buffer (RFB) to exchange data about graphics output as well as keyboard and mouse input. RFB turns out to be a very sane protocol (specification PDF here) compared with X11, and infinitely more sane than Cocoa (which requires the ObjC runtime) or Win32 (no explanation needed). So, I thought, why not just expose a program’s graphical interface as a VNC server? Then we can let a VNC client deal with the vagaries of the host windowing environment, and we only need to speak a well-specified protocol on a socket.

So far, this is what I have to show (code on github):

This also turned out to be a good exercise in both raw socket programming and the use of zlib (the DEFLATE compression library), both of which I’ve skirted around before but never actually done directly in C¹. Check out my open_port function:

color_rotate_zrle.ccontext

int open_port(uint16_t port) {
  int connfd, sockfd, y[1]={1};
  struct sockaddr_in addr = {.sin_family=AF_INET,.sin_port=htons(port),.sin_addr={.s_addr=htonl(INADDR_ANY)}};
  if( ( sockfd = socket(PF_INET, SOCK_STREAM, 0)                         ) < 0)  perror(  "socket"  );
  if( (      setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, y, sizeof(int))) < 0)  perror("setsockopt");
  if( (            bind(sockfd, (struct sockaddr*)&addr, sizeof(addr))   ) < 0)  perror(   "bind"   );
  if( (          listen(sockfd, 1)                                       ) < 0)  perror(  "listen"  );
  if( ( connfd = accept(sockfd, NULL, 0)                                 ) < 0)  perror(  "accept"  );
  return connfd;
}

Once the socket connection is established, there’s some handshaking to do (as you can see, this is pretty stubby — it doesn’t wait for any messages from the client):

color_rotate_zrle.ccontext

  int   connfd = open_port(PORT);
  write(connfd, protover,          sizeof(protover)-1);
  write(connfd, securitytype,      sizeof(securitytype));
  write(connfd, securitychallenge, sizeof(securitychallenge));
  write(connfd, securityresult,    sizeof(securityresult));
  write(connfd, serverInit,        sizeof(serverInit));
  write(connfd, name,              sizeof(name)-1);

Then, we can get down to business:

color_rotate_zrle.ccontext

  z_streamp z = malloc(sizeof(z_stream));
  deflateInit(z,6);
  uint8_t* buf=malloc(FBUFZ);
  uint8_t tile[] = {0x01, 0, 0, 255}; //solid blue
  const int frame_size=sizeof(tile)*(width/64)*(height/64);
  uint8_t* frame=malloc(frame_size);
  int t;
  double h=0, c=1, l=0.5;
  while(1) {
    hcl2pix(&tile[1],h,c,l);
    h+=0.01;
    for(t=0;t<(width/64)*(height/64);t++)
      memcpy(&frame[t*sizeof(tile)],tile,sizeof(tile));
    z->next_in=frame;
    z->avail_in=frame_size;
    z->next_out=buf;
    z->avail_out=FBUFZ;
    z->total_out=0;
    deflate(z,Z_SYNC_FLUSH);
    int length = htonl(z->total_out);
    write(connfd,fbuf_refresh,sizeof(fbuf_refresh));
    write(connfd,&length,4);
    write(connfd,buf,z->total_out);
    usleep(1e6/30);
  }

I’ve chosen to implement the encoding scheme ZRLE here, but most VNC clients will also support streaming raw pixel data, which would remove the dependency on zlib and simplify the logic somewhat². In the ZRLE encoding, the display area is split into 64x64-pixel “tiles”, each of which can be described in a variety of palletized and non-paletized encodings. The simplest — the one we’re using here — is the one-color palette, introduced by 0x01, and containing simply the one color (no further data is needed, since it’s implied that every pixel in the tile is that color). So, in our main display loop, we first update the tile (the hcl2pix function is one of my own devising, which you can find in colorspaces.c), then copy the (64x64) tile as many times as necessary to make a complete frame, then deflate it, and finally write it out to the socket and wait until it’s time for the next frame. That’s the essence of the program right there.

You may also be interested in the details of the RFB message formats:

color_rotate_zrle.ccontext

const char protover[] = "RFB 003.003\n";
const char securitytype[] = {0x00, 0x00, 0x00, 0x02};
const char securitychallenge[16] = {0xaa};
const char securityresult[4] = {0};
const char name[] = "hello!";
const uint16_t width=1024, height=1024;
 
int main() {
  const char serverInit[] = {
    /*frame size*/   width>>8, width&0xff, height>>8, height&0xff,
    /*bpp*/ 32, /*depth*/ 24, /*big-endian*/ 0, /*true-colour*/ 1,
    /*red mask*/     0, 0xff,
    /*green mask*/   0, 0xff,
    /*blue mask*/    0, 0xff,
    /*red shift*/    0,
    /*green shift*/  8,
    /*blue shift*/  16, /*padding*/ 0,0,0,
    /*name length*/  0, 0, 0, sizeof(name)-1 };
  const char fbuf_refresh[] = {
    /*message-type*/ 0,
    /*padding*/      0,
    /*nrects*/       0, 1,
    /*xpos*/         0, 0,
    /*ypos*/         0, 0,
    /*width*/        width>>8, width&0xff,
    /*height*/       height>>8, height&0xff,
    /*encoding-type*/0, 0, 0, 16 };

Future work includes:

Splitting out the frame encoding process to a send_rect function
Actually parsing messages from the VNC client
Providing user input handlers
Comparing to an SDL backend: same send_rect and register_handler abstractions might be nearly as easy to implement
Implementing a box model to route user input to interface elements
Implementing font rendering with FreeType
Implementing TeX+TikZ style graphics (big job)
Creating useful interface elements for this platform

Yes, I did this in C. Almost every operation in the program is a function call, following the C calling convention, so it really wouldn’t be fun to do in assembly.↩
Why did I choose ZRLE, then? Well, partly because I thought it was cool, and partly because I wanted to get some practice using zlib. But mostly because Apple’s “Screen Sharing” VNC client advertises ZRLE as one of few standard RFB encodings it accepts. Yet, this code as it is still doesn’t work with Screen Sharing. I wound up testing it with Chicken instead.↩

Concurrency Primitives in Intel 64 Assembly

2014-03-23T20:36:47-04:00

Now that nearly every computer has some form of multi-processing (that is, multiple CPUs sharing a single address space), some high-level languages are starting to get attention for their concurrency features. Many languages refer to such features as “concurrency primitives.” But since these are high-level languages, we know that these “primitives” must ultimately be implemented with hardware operations. Older high-level languages, like C, don’t have baked-in support for such operations – not because such languages are lower-level, but simply because the operations in question weren’t a thing when C was invented. Assembly language, being up to date with the latest CPU capabilities by definition¹, should provide the best window into the true nature of today’s concurrency operations.

In this post I’m going to walk you through a (relatively) simple concurrent assembly program which runs on OSX or Linux. Here’s the demo (github):

bash-3.2$ time ./concurrency-noprint-x1 foo    # single-worker version

real  0m1.458s
user  0m1.445s
sys   0m0.010s
bash-3.2$ # now run two at once
bash-3.2$ time ./concurrency-noprint-x1 foo-2 & ./concurrency-noprint-x1 foo-2
[1] 71366

real  0m0.785s
user  0m0.780s
sys   0m0.001s
[1]+  Done                    time ./concurrency-noprint-x1 foo-2
bash-3.2$ time ./concurrency-noprint-x4 foo-3  # four-worker version

real  0m0.417s
user  0m0.413s
sys   0m0.003s
bash-3.2$ time ./concurrency-noprint-x7 foo-4  # seven-worker version

real  0m0.295s
user  0m0.283s
sys   0m0.001s
bash-3.2$ diff -s --from-file=foo foo-*
Files foo and foo-2 are identical
Files foo and foo-3 are identical
Files foo and foo-4 are identical

What the program actually does is a pretty useless but computationally nontrivial and easily parallelizable task: taking the offset of each byte from the start of the buffer to the 65537th power mod 235, and storing that value back to each byte. Since it’s mod-235, the output should repeat itself every 235 bytes:

bash-3.2$ hexdump -e '235/1 "%4u" "\n"' -s8 foo
   1  37 158 194  35 206  32 128  54 120 146 102 233   9 125  36 122 118 184 210 121 232  33  14  50 161  72  98 164 160  11 157  38  49 180 136 162 228 154  15  76  12  88 124  10  46  47  48  84 205   6  82  18  79 175 101 167 193 149  45  56 172  83 169 165 231  22 168  44  80  61  97 208 119 145 211 207  58 204  85  96 227 183 209  40 201  62 123  59 135 171  57  93  94  95 131  17  53 129  65 126 222 148 214   5 196  92 103 219 130 216 212  43  69 215  91 127 108 144  20 166 192  23  19 105  16 132 143  39 230  21  87  13 109 170 106 182 218 104 140 141 142 178  64 100 176 112 173  34 195  26  52   8 139 150  31 177  28  24  90 116  27 138 174 155 191  67 213   4  70  66 152  63 179 190  86  42  68 134  60 156 217 153 229  30 151 187 188 189 225 111 147 223 159 220  81   7  73  99  55 186 197  78 224  75  71 137 163  74 185 221 202   3 114  25  51 117 113 199 110 226   2 133  89 115 181 107 203  29 200  41  77 198 234   0
*

Here, I’m asking hexdump to display this binary file in lines of 235 bytes each, one byte at a time, giving each byte 4 characters field-width and printing it as an unsigned integer (in decimal), with a newline at the end of the line, starting from offset 8 (as the first 8 bytes of the file are used by the concurrency mechanism for bookkeeping purposes²). The * on the second line of hexdump’s output means “every line after this matches it,” so the file must repeat itself every 235 bytes until the end. We can suppress the * with -v and examine the last 4 lines, just to be sure we understand it correctly:

bash-3.2$ hexdump -e '235/1 "%4u" "\n"' -s8 -v foo | tail -n4
   1  37 158 194  35 206  32 128  54 120 146 102 233   9 125  36 122 118 184 210 121 232  33  14  50 161  72  98 164 160  11 157  38  49 180 136 162 228 154  15  76  12  88 124  10  46  47  48  84 205   6  82  18  79 175 101 167 193 149  45  56 172  83 169 165 231  22 168  44  80  61  97 208 119 145 211 207  58 204  85  96 227 183 209  40 201  62 123  59 135 171  57  93  94  95 131  17  53 129  65 126 222 148 214   5 196  92 103 219 130 216 212  43  69 215  91 127 108 144  20 166 192  23  19 105  16 132 143  39 230  21  87  13 109 170 106 182 218 104 140 141 142 178  64 100 176 112 173  34 195  26  52   8 139 150  31 177  28  24  90 116  27 138 174 155 191  67 213   4  70  66 152  63 179 190  86  42  68 134  60 156 217 153 229  30 151 187 188 189 225 111 147 223 159 220  81   7  73  99  55 186 197  78 224  75  71 137 163  74 185 221 202   3 114  25  51 117 113 199 110 226   2 133  89 115 181 107 203  29 200  41  77 198 234   0
   1  37 158 194  35 206  32 128  54 120 146 102 233   9 125  36 122 118 184 210 121 232  33  14  50 161  72  98 164 160  11 157  38  49 180 136 162 228 154  15  76  12  88 124  10  46  47  48  84 205   6  82  18  79 175 101 167 193 149  45  56 172  83 169 165 231  22 168  44  80  61  97 208 119 145 211 207  58 204  85  96 227 183 209  40 201  62 123  59 135 171  57  93  94  95 131  17  53 129  65 126 222 148 214   5 196  92 103 219 130 216 212  43  69 215  91 127 108 144  20 166 192  23  19 105  16 132 143  39 230  21  87  13 109 170 106 182 218 104 140 141 142 178  64 100 176 112 173  34 195  26  52   8 139 150  31 177  28  24  90 116  27 138 174 155 191  67 213   4  70  66 152  63 179 190  86  42  68 134  60 156 217 153 229  30 151 187 188 189 225 111 147 223 159 220  81   7  73  99  55 186 197  78 224  75  71 137 163  74 185 221 202   3 114  25  51 117 113 199 110 226   2 133  89 115 181 107 203  29 200  41  77 198 234   0
   1  37 158 194  35 206  32 128  54 120 146 102 233   9 125  36 122 118 184 210 121 232  33  14  50 161  72  98 164 160  11 157  38  49 180 136 162 228 154  15  76  12  88 124  10  46  47  48  84 205   6  82  18  79 175 101 167 193 149  45  56 172  83 169 165 231  22 168  44  80  61  97 208 119 145 211 207  58 204  85  96 227 183 209  40 201  62 123  59 135 171  57  93  94  95 131  17  53 129  65 126 222 148 214   5 196  92 103 219 130 216 212  43  69 215  91 127 108 144  20 166 192  23  19 105  16 132 143  39 230  21  87  13 109 170 106 182 218 104 140 141 142 178  64 100 176 112 173  34 195  26  52   8 139 150  31 177  28  24  90 116  27 138 174 155 191  67 213   4  70  66 152  63 179 190  86  42  68 134  60 156 217 153 229  30 151 187 188 189 225 111 147 223 159 220  81   7  73  99  55 186 197  78 224  75  71 137 163  74 185 221 202   3 114  25  51 117 113 199 110 226   2 133  89 115 181 107 203  29 200  41  77 198 234   0
   1  37 158 194  35 206  32 128  54 120 146 102 233   9 125  36 122 118 184 210 121 232  33  14  50 161  72  98 164 160  11 157  38  49 180 136 162 228 154  15  76  12  88 124  10  46  47  48  84 205   6  82  18  79 175 101 167 193 149  45  56 172  83 169 165 231  22 168  44  80  61  97 208 119 145 211 207  58 204  85  96 227 183 209  40 201  62 123  59 135 171  57  93  94  95 131  17  53 129  65 126 222 148 214   5 196  92 103 219 130 216 212  43  69 215  91 127 108 144  20 166 192  23  19 105  16 132 143  39 230

Notice that it doesn’t have an even multiple of 235 bytes – if you scroll all the way over, you’ll see that the very last line ends in the middle. That’s because this file isn’t generated by printing a particular 235-byte sequence in a loop. Rather, every 8-byte machine word is computed separately; the 235-byte repeating structure is built into the nature of the problem the program solves (which I chose, in part, so that it’s easy to check whether the results are sensible).

Critical Sections #

Let’s begin with a conceptual overview of the problem concurrency primitives are supposed to address³. When we have a single process operating in an address space (that it, a non-concurrent process), we can reason about the state of the entire address space at a particular point during the execution of the process. We can make statements like “this variable must be positive because we checked that it was positive four lines of code ago and we haven’t changed it since then.” In a concurrent process, a lot of this reasoning goes out the window, because our process’s sibling might have set the variable to -1 between here and there. We just have no way of knowing – the possible state transitions are too various to justify strong claims about. Claims like “the program produces correct output” tend to be very strong indeed in this context, and we often want to make such claims (at least to ourselves).

So, much like security is, in a sense, all about limiting features, concurrency primitives are means to restrict the possible state transitions of memory shared by multiple processors. The most common bugaboo is that a shared-memory state could have been changed by a sibling between the time that we measure it and the time that we take action based upon that measurement. So, as a general rule, the most basic concurrency operations measure a shared state and then use the data to change that shared state, while excluding siblings from accessing it throughout the whole operation. A region of code like this – where siblings are not allowed, during its execution, to access a particular memory location – is called a critical section.

On all Intel CPUs prior to Haswell (which started shipping last year; I don’t have one yet), “actual” critical sections are limited to single machine instructions with a lock prefix; larger critical sections can be emulated based on these single-instruction primitives. We’ll be doing a variation of this today.

Tasks and Workers #

I don’t know of any particular argument to justify the tasks-and-workers perspective on parallel computing, but in practice, it’s the one I’ve found most useful for organizing my code, and it seems to be fairly common. The idea is this: we divide our program’s workload into tasks of some granularity, and each task is picked up and operated on by exactly one of some number of interchangeable workers, which each run concurrently. The tasks should not be too small, so that the amount of work involved in choosing a task is not too great⁴, but they should also not be too large, so that if any worker finishes a task earlier than the others, there will likely be another task ready for it to do.

In this context, a task is a type of critical section, because once a task has been picked up by any single worker, the other workers are supposed to leave it alone. But critical sections carry the connotation of being very small, so they can execute and get out of the way quickly. I suppose I’ve been fortunate enough to work in domains where I’ve had the luxury to split things up into independent tasks most of the time. (These domains are sometimes referred to as embarrassingly parallel.) But in the code below, we’ll also see one example of a state variable which is operated on by short critical sections instead of tasks.

Show me the code already! #

concurrency.asmcontext

%include "os_dependent_stuff.asm"
 
  ; Initialize constants.
  mov r12, 65537                 ; Exponent to modular-exponentiate with
  mov rbx, 235                   ; Modulus to modular-exponentiate with
  mov r15, NPROCS                ; Number of worker processes to fork.
  mov r14, (SIZE+1)*8            ; Size of shared memory; reserving first
                                 ; 64 bits for bookkeeping

The first two constants are pretty straightforward – just the parameters of the task to compute. NPROCS and SIZE need a bit of explanation. These are constants which are actually defined in the Makefile and passed in to nasm using the -D option (as in -DNPROCS=7)⁵. SIZE is actually measured in 8-byte machine words; it’s the number of tasks we want to perform. (As I briefly mentioned earlier, each task in this program is an 8-byte machine word of the output file to be computed.)

concurrency.asmcontext

  ; Check for command-line argument.
  cmp qword [rsp], 1
  je map_anon

When our program is entered by the OS, the command-line is on the stack; [rsp] is the number of command-line tokens (including the name of the program itself), [rsp+8] will be a pointer to the name of the program, [rsp+2*8] a pointer to the first command-line argument (if there is one), and so on. If we don’t have any command-line arguments, then the number of tokens will be 1 (just the name of the program). In this case, we’re going to a request an anonymous region of memory; otherwise, we’re going to open the file specified on the command line and map that. Note: if you haven’t seen mmap before, check out its man page. In my opinion, it’s the “right” way to do either memory allocation (“anonymous” mappings) or file I/O.

`open()`, `ftruncate()`, and `mmap()` #

concurrency.asmcontext

open_file:
  ; We have a file specified on the command line, so open() it.
  mov rax, SYSCALL_OPEN          ; set up open()
  mov rdi, [rsp+2*8]               ; filename from command line
  mov rsi, O_RDWR|O_CREAT          ; read/write mode; create if necessary
  mov rdx, 660o                    ; `chmod`-mode of file to create (octal)
  syscall                        ; do open() system call
  mov r13, rax                   ; preserve file descriptor in r13
  mov rax, SYSCALL_FTRUNCATE     ; set up ftruncate() to adjust file size
  mov rdi, r13                     ; file descriptor
  mov rsi, r14                     ; desired file size
  syscall                        ; do ftruncate() system call
  mov r8,  r13
  mov r10, MAP_SHARED
  jmp mmap
 
  ; Ask the kernel for a shared memory mapping.
map_anon:
  mov r10, MAP_SHARED|MAP_ANON     ; MAP_ANON means not backed by a file
  mov r8,  -1                      ; thus our file descriptor is -1
mmap:
  mov r9,   0                      ; and there's no file offset in either case.
  mov rax, SYSCALL_MMAP          ; set up mmap()
  mov rdx, PROT_READ|PROT_WRITE    ; We'd like a read/write mapping
  mov rdi,  0                      ; at no pre-specified memory location.
  mov rsi, r14                     ; Length of the mapping in bytes.
  syscall                        ; do mmap() system call.
  test rax, rax                  ; Return value will be in rax.
  js error                       ; If it's negative, that's trouble.
  mov rbp, rax                   ; Otherwise, we have our memory region [rbp].

concurrency.asmcontext

error:
  mov rdi, rax                   ; In case of error, return code is -errno...
  mov rax, SYSCALL_EXIT
  neg rdi                        ; ...so negate to get actual errno
  syscall

We actually have to make three system calls to get this set up: one to open the file (SYSCALL_OPEN), one to extend it to the appropriate size (SYSCALL_FTRUNCATE), and finally one to make the memory mapping (SYSCALL_MMAP).

`lock add` #

concurrency.asmcontext

  lock add [rbp], r15            ; Add NPROCS to the file's first machine word.
                                 ; We'll use it to track the # of still-running
                                 ; worker processes.

Here’s our first concurrency primitive! We’re going to add NPROCS to the first machine word of this file. We’re counting on the fact that when the file is first created, all bytes will appear to be zero (a fact which is actually true on most Unix implementations). Why aren’t we just setting the word to zero? Well, a neat feature of this program is that we can run multiple copies of it on the same file, and they’ll share the work as if by magic. So, if we’re running as the second copy of the program, we don’t want to clobber this piece of bookkeeping state – we just want to contribute NPROCS workers to the worker pool.

`fork()` #

concurrency.asmcontext

  ; Next, fork NPROCS processes.
fork:
  mov eax, SYSCALL_FORK
  syscall
%ifidn __OUTPUT_FORMAT__,elf64     ; (This means we're running on Linux)
  test rax, rax                  ; We're a child iff return value of fork()==0.
  jz child
%elifidn __OUTPUT_FORMAT__,macho64 ; (This means we're running on OSX)
  test rdx, rdx                  ; Apple...you're not supposed to touch rdx here
  jnz child                      ; Apple, what
%endif
  dec r15
  jnz fork

Apple’s implementation of fork() is a little messed up, so unfortunately we’re forced to put some OS-dependent logic in here. But the basic idea is, we’re going to keep calling fork() only if (a) we’re the parent process, and not a newly fork()ed worked process, and (b) the number of processes we were supposed to spawn hasn’t decremented to zero yet.

The parent process #

concurrency.asmcontext

parent:
  pause
  cmp qword [rbp], 0
  jnz parent                     ; Wait for [rbp], the worker count, to be zero

Now, the parent process simply waits until there aren’t any active workers/child processes (they’ll gracefully disappear once there’s no more work for them to do). The pause instruction is a hint to the system that it shouldn’t actually spend a lot of energy spinning in this loop.

concurrency.asmcontext

exit_success:
  mov eax, SYSCALL_EXIT          ; Normal exit
  mov edi, 0
  syscall

Once the number of active workers is zero, the parent bails out, returning the sucess code, 0.

Dividing up work #

Here’s where our workers divide up their tasks – the most important concurrency-related operation in the program:

concurrency.asmcontext

child:
  mov rsi, r14                   ; Restore rsi from r14 (saved earlier)
  mov cl, 0xff                   ; Set rcx to be nonzero
  mov rdi, 8                     ; Start from index 8 (past the bookkeeping)
find_work:                       ; and try to find a piece of work to claim
  xor rax, rax
  cmp qword [rbp+rdi], 0         ; Check if qword [rbp+rdi] is unclaimed.
  jnz .moveon                    ; If not, move on - no use trying to lock.
  lock cmpxchg [rbp+rdi], rcx    ; Try to "claim" qword [rbp+rdi] if it is still
                                 ; unclaimed.
  jz found_work                  ; If successful, zero flag is set
.moveon:
  add rdi, 8                     ; Otherwise, try a different piece.
find_work.next:
  cmp rdi, rsi                   ; Make sure we haven't hit the end.
  jne find_work

The worker is linearly scanning each 8-byte word, starting with the second one in the [rbp] region (since the word right at [rbp] just represents how many workers there are), looking for one that is zero. As we covered earlier, the file is going to start out being all zeroes. The way I imagine this setup in my head, the file starts out as a barren desert of zeroes, like the old American West, and the workers are searching for a plot of land to homestead on. The first empty plot of land they find, they put up a sign that says “RESERVED” and they start building their homestead (that’s the task). In this case, the RESERVED sign is 0xff. Now, other workers will keep on movin’ until they find their own plot of land. The key here is to prevent two workers from putting up a RESERVED sign at the same location. That’s where lock cmpxchg comes in.

`lock cmpxchg` (compare-and-swap) #

This is a slightly complex but beautiful operation. It takes three parameters:

a memory location ([rbp+rdi] in this case), which has operand size dependent on the operand size of the next parameter (in this case, it’s an 8-byte machine word, because the next parameter is an 8-byte register)⁶,
an update value to store (rcx in this case, holding the value 0xff, our sentinel for RESERVED), and
an expected value to compare against (always rax, an implicit parameter, and in this case zeroed out by xor rax, rax; zero is the value of unreserved words because it is the value freshly allocated files are filled with).

The first thing lock cmpxchg will do is lock the memory location and compare it to the expected value. Then, depending on the result, one of two things will happen:

If the comparison fails, that means the state of memory isn’t what we expected—it must have changed since we last looked. This is bad news; the update is aborted. To inform us exactly what went wrong, cmpxchg will overwrite rax with whatever is actually in memory now (instead of what we expected). The zero-flag ZF will be cleared to signal non-equality, and the memory location will be unlocked.
If the value in memory does match what we expected, then our update value replaces it in that memory location before any other CPU/core has a chance to either read or write there. That’s a “successful” compare-and-swap. The zero-flag ZF will be set to signal success, and the memory location will be unlocked as soon as it is updated.

The upshot in our application is that it’s impossible for more than one worker to reserve the same task, because reservation always happens in an atomic (locked) operation, which:

will be aborted if another reservation happened before it, and
will prevent any other atomic operation from starting until this one is done.

“test-and-test-and-set” #

You may notice that we do an ordinary cmp in advance of the lock cmpxchg. That’s not strictly necessary, but it speeds up this part of the program quite bit; if a location was already claimed some time ago, we may as well notice that before putting a lock on it (which is a moderately expensive operation) and simply move on until we find something that looks empty (then lock cmpxchg to be sure it’s empty).

Doing the task #

concurrency.asmcontext

found_work:
  mov r8, 8                      ; There are 8 pieces per task.
do_piece:                       ; This part does the actual work of mod-exp.
  mov r13, r12                   ; Copy exponent to r13.
  mov rax, rdi                   ; The actual value to mod-exp should start
  sub eax, 0x7                   ; at 1 for the first byte after the bookkeeping
  xor rdx, rdx                   ; word. This value is now in rax.
  div rbx                        ; Do modulo with modulus.
  mov r11, rdx                   ; Save remainder -- "modded" base -- to r11.
  mov rax, 1                     ; Initialize "result" to 1.
.modexploop:
  test r13, 1                    ; Check low bit of exponent
  jz .shift
  mul r11                        ; If set, multiply result by base
  div rbx                        ; Modulo by modulus
  mov rax, rdx                   ; result <- remainder
.shift:
  mov r14, rax                   ; Save result to r14
  mov rax, r11                   ; and work with the base instead.
  mul rax                        ; Square the base.
  div rbx                        ; Modulo by modulus
  mov r11, rdx                   ; base <- remainder
  mov rax, r14                   ; Restore result from r14
  shr r13, 1                     ; Shift exponent right by one bit
  jnz .modexploop                ; If the exponent isn't zero, keep working
  mov byte [rbp+rdi], al         ; Else, store result byte.
  inc rdi                        ; Move forward
  dec r8                         ; Decrement piece counter
  jnz do_piece                   ; Do the next piece if there is one.
  jmp find_work.next             ; Else, find the next task.

This article is long enough without a detailed prose explanation of binary exponentiation (which isn’t what it’s about, anyway). Suffice it to say that given an offset into the [rbp] region rdi, this chunk of code will replace each byte from [rbp+rdi] to [rbp+rdi+7] with the appropriate mod-exps of the values rdi through rdi+7. The code is somewhat deliberately inefficient (lots of divs, which consume dozens of clock cycles each) for realism’s sake—we want tasks to take a nontrivial length of time.

Being done #

Note: the block of code below is out-of-order and overlaps both of the previous two.

concurrency.asmcontext

find_work.next:
  cmp rdi, rsi                   ; Make sure we haven't hit the end.
  jne find_work
 
child_exit:                      ; If we have hit the end, we're done.
  lock dec qword [rbp]           ; Atomic-decrement the # of active processes.
  jmp exit_success
 
found_work:

Once there are no more unclaimed tasks to claim, we’re going to successfully terminate the worker. But first, we need to decrement the number of active workers.

`lock dec` #

By now you can probably guess that lock dec is a version of the dec (decrement) operation which will ensure that no other worker can decrement the number-of-active-workers variable at the same time (e.g. the last two workers reading the value 2, decrementing it, and both writing back 1 and exiting, with nobody left to decrease it to 0).

What should have been done differently
(if this weren’t just an example) #

It’s worth pointing out that for this particular problem, I did a lot of things here that don’t actually make so much sense.

There’s no particular reason to allow multiple simultaneous invocations of the whole program on the same file. If that requirement is relaxed, then it makes a lot more sense to divide up the work by starting each worker at a different offset and having them all skip $n$ tasks ahead when they finish (e.g. with $n=7$ workers, the seventh worker would take the $(7k+6)$th task for every integer $k$).
Even with the requirement in question, there would be more efficient ways to divide up tasks—for instance, instead of trying to claim every task in order, workers could maintain a second bookkeeping word which would track the address of the current next-unclaimed-task.
Tasks should have been rather larger than single 8-byte machine words; the coordination overhead for tasks at this fine granularity is unlikely to pay off.
The modular exponentiation could have been implemented more efficiently.
In fact, since the result is just a single 235-byte pattern that repeats over and over, I could have just computed it once and repeatedly written it into the file. (Since this would be a primarily storage-bound operation, there wouldn’t even be much sense in parallelizing it.)

But hey, now we know how to write concurrent x64 programs using memory-mapped files.

Conclusion #

In whichever abstraction we’re working, if we’re doing concurrent processing on an Intel platform, it may be worth considering how the abstraction resolves down to concepts like these. See if your platform exposes a thing like mmap(), for instance, and consider how your “concurrency primitives” might translate into individual locked operations. This will assist in reasoning about performance issues, as well as providing a deeper understanding of your concurrency primitives’ guarantees.

And, of course, make sure that you’ve given assembly a big check-mark under “Has Concurrency Primitives?” on your personal programming-environment scorecard.

This may be counterintuitive if you think of assembly as a conspicuously old-school way to program. I won’t deny that it is, but nasm, DynASM, r2, and the other tools I use for assembly hacking are relentlessly kept in sync with Intel’s assembly-language specification, which is updated in advance of every new CPU release. Other tools take much longer to adapt because, well, Intel doesn’t specify exactly how they should make use of new features. So, in fact, the latest hardware is supported in assembly before it’s supported anywhere else.↩
If I had been doing serious work, instead of using a flat binary file, I would be using Cap’n Proto, so the bookkeeping field(s) would be well-delineated. Perhaps in a future article, I’ll show how to do that from assembly. Then, instead of hexdump, I’d be using capnp to explore the data. But hexdump is a quite versatile tool and nice to know anyway.↩
Pun not intended.↩
The code displayed here violates this rule pretty badly, which is probably why the speedup from running in parallel is noticeably worse than ideal, but I think to do better would overcomplicate the presentation.↩
I could (should?) have used command-line arguments for these values, but let’s face it, parsing command-line arguments is annoying in any language, let alone assembly.↩
There are some applicatons for which you might wish to compare-and-swap two machine words (often, two pointers) in a single atomic operation. This can be done using the lock cmpxchg16b instruction (note: the 16 bytes still have to be contiguous in memory, and in fact must be 16-byte-aligned).↩

The Security/Product Design Correspondence

2014-03-16T09:55:30-04:00

If programming is the art of adding functionality to computers, security is the art of removing it.

This maxim is a bit unfair to deep and wonderful world of information security (InfoSec), but it has a point. A lot of essential concepts in InfoSec have natural opposites in software product design.

Let’s start at the top. Every professional software project begins with specifications. In product design, the specifications are called use cases: stories about an external agent who wants to perform some function, and how they would go about performing the function using your software. In InfoSec, the specifications are called threats. These are also stories about an external agent who wants to perform some function, and how would go about performing the function using your software. The difference is, in product design, you want to make the agent’s job as easy as possible, while in InfoSec, you want to make it as hard as possible. We also have these related correspondences:

Use case model ⇔ Threat model
User ⇔ Attacker
User interface ⇔ Attack surface
Interaction ⇔ Protocol
Affordance ⇔ Vulnerability

In product design, the goal is to address all use cases with a set of features. The correspondence between a use case model and a feature set is nontrivial, and translating use cases into features is arguably the core of the product designer’s job. Meanwhile in InfoSec, the next step is to address all threats with a set of claims; the correspondence between a threat model and a set of security claims is nontrivial in the same sense. Both involve many assumptions about what the user/attacker is willing and able to do, and guesses about the best way to enable/prevent them from achieving their objectives, drawing on a lot of experience and patterns observed in the field with both well-designed and badly-designed products/security systems.

The most common features and most common security claims are also related:

A view/display/read feature, enabling a user to access a record of information, is the opposite of a confidentiality claim, guaranteeing that an attacker cannot access information.
A modify/update feature, enabling a user to edit a record of information, is the opposite of an integrity claim, guaranteeing that an attacker cannot modify information without detection.
A create feautre, enabling a user to add a new record, is the opposite of an authenticity claim, guaranteeing that an attacker cannot create a new record.
A delete/remove feature, enabling a user to destroy a record, is the opposite of a non-repudiation claim, guaranteeing that an attacker cannot credibly deny the existence of information once it is entered into the system.

This correspondence is essentially perfect for confidentiality and integrity; autheticity and non-repudiation are a little more subtle. Just as any system which supports both creation and deletion technically supports modification (since a user can delete a record and then add back a modified version), any system which provides authenticity and non-repudiation also provides integrity.

One place where InfoSec and product design overlap is availability. The product design version of availability is that a user wishes to access our system through some communications channel, and it must be able to respond. The InfoSec version is that an attacker wishes to cause our system to stop responding to legitimate users (usually, though not always, via denial of service techniques), and the attacker must be unable to do this.

Availability is commonly listed beside confidentiality and integrity as one of the “three core goals” of information security, but it is really a different kind of thing. It’s sometimes possible to get all four of the other security claims listed above simply by careful application of off-the-shelf cryptographic primitives, but there are no such cryptographic solutions for availability. The closest thing to a magic availability solution is massive scale, with redundant nodes all over the planet ready to take up the slack if other nodes stop responding. (BitTorrent and Bitcoin both fall into this category.) However, truly high availability requires a dedicated 24x7 staff equipped to respond to emerging threats. It is probably best to let someone else handle that.

You may also come across the words authorization and authentication connected with some of the above. These are issues without clear product design correspondences (except insofar as products are designed to provide them in their InfoSec senses). Like trust and risk, they also tend to be intricately tied up in human affairs. These terms, along with the basic categories of cryptographic primitives, will be treated in future InfoSec articles.

Systems Past: the only 8 software innovations we actually use

2014-03-12T19:21:49-04:00

Note: This is a position piece, not a technical article. Hat tip to Jake Skelcy for requesting such a piece.

Computers didn’t always have operating systems. The earliest machines, like the Harvard Mark I and the EDVAC, performed one “computation” at a time. Whenever a computation finished, with its output printed by a teletypewriter or recorded on a magnetic tape, the machine would shut down. A person would then have to notice the machine stopped, unload the output, set up a new computation by manually loading the input and program instructions, and finally, press the start button to get the machine cranking again. On the Harvard Mark I, for instance, restarting would involve separately turning on multiple electric motors and then pressing a button marked MAIN SEQUENCE.

This is the context in which the programming language (PL) and the operating system (OS) were invented. The year was 1955. Almost everything since then has been window dressing (so to speak). In this essay, I’m going to tell you my perspective on the PL and the OS, and the six other things since then which I consider significant improvements, which have made it into software practice, and which are neither algorithms nor data structures (but rather system concepts). Despite those and other incremental changes, to this day¹, we work exclusively² within software environments which can definitely be considered programming languages and operating systems, in exactly the same sense as those phrases were used almost 60 years ago. My position is:

Frankly, this is backward, and we ought to admit it.
Most of this stuff was invented by people who had a lot less knowledge and experience with computing than we have accumulated today. All of it was invented by people: mortal, fallible humans like you and me who were just trying to make something work. With a solid historical perspective we can dare to do better.

1. The Programming Language #

Year: 1955

Archetype

Every programming language used today is descended from FORTRAN ³. FORTRAN is an abbreviation of FORmula TRANslator, and its mission was to translate typewritten algebraic formulae into executable code.

Motivation

Most uses of computers involved numerical calculations, which would be translated from equation form into machine code by hand (naturally, a time-consuming process). Multiple people (including Grace Hopper, John Backus, and Alick Glennie) realized that the computer could be used to automate such translations, and the result was the programming language.

Concept

A programming language is a piece of software that automatically translates a specially formatted block of linear text into executable code.

It is bizarre that we’re still expressing programs entirely with text 59 years later when the first interactive graphical display appeared 4 years later⁴.

Benefits

The existence of programming languages enabled the use of concise notation for complex ideas, also known as abstraction. This not only saves time, but also makes programs easier to understand and maintain.

Exemplars

Drawbacks

FORTRAN’s conflation of functions (an algebraic concept) and subroutines (a programming construct) persists to this day in nearly every piece of software, and causes no end of problems. Tracing compilers scratch the surface of reversing this mistake, but so far I know of no programming languages that are specifically designed around such a mechanism.
The fact that inputs had to be loaded into computers as stacks of punched cards limited the possible means of expressing computations – lines of text.

2. The Operating System #

Year: 1955

Archetype

The General Motors/North American Aviation Monitor was arguably the “original” OS.

Motivation

The typical mode of operation was programmer present and at the operating console. When a programmer got ready for a test, he or she signed up on a first-in, first-out list, much like the list at a crowded restaurant. The programmer then checked progress frequently to estimate when he would reach the top. When his time got close, he stood by with card deck in hand. When the previous person finished or ran out of allotted time or abruptly crashed, the next programmer rushed in, checked the proper board was installed in the card reader, checked that the proper board was installed in the printer, checked that the proper board was installed on the punch, hung a magnetic tape, punched in on a mechanical time clock, addressed the console, set the appropriate switches, loaded his punched card deck in the card reader, prayed the first card would not jam, and pressed the LOAD button to invoke the bootstrap sequence.

If all went well, you could load a typical deck of about 300 cards and begin the execution of your first instruction about 5 minutes after entering the room. If only one person did all this set up and got going in 5 minutes, he bustled around the machine like a whirling dervish [sic]. Not always did things go so smoothly. If a programmer was fumble-fingered, cards jammed, magnetic tapes would not read due to defective splices, printer boards or switches were incorrectly set up, and it took 10 minutes to get going; or worse – you lost your opportunity and the next person took the machine when your time ran out. Usually the machine spent more time idle than computing. We programmers weren’t paid very much and although the machine was fairly costly, its capacity was even a more precious commodity since there were only 17 in the whole world.

(source)

Concept

An operating system is a piece of software that facilitates the execution of multiple independent programs on one computer, using standard input and output routines.

There’s a deep connection between the OS concept and the PL concept: the OS facilitates the execution of independent programs, while the PL facilitates the execution of independent modules or subroutines. In fact, GM/NAA OS was literally a modification of the octal code of the FORTRAN compiler tape.

The bizzareness about operating systems is that we still accept unquestioningly that it’s a good idea to run multiple programs on a single computer with the conceit that they’re totally independent. Well-specified interfaces are great semantically for maintainability. But when it comes to what the machine is actually doing, why not just run one ordinary program and teach it new functions over time? Why persist for 50 years the fiction that every distinct function performed by a computer executes independently in its own little barren environment?

Benefits

Multiple programs could be run in a “batch,” thus keeping the machine from ever being idle (except in case of hardware failure or an empty job queue).
Programmers could now use standard input and output routines. (Depending on the formatting requirements and particular peripherals in use, properly handling input and output could previously have consumed most of the programming effort for simple jobs.)
Bare-hands reconfiguration of hardware (e.g. plugboards) finally disappeared from the work of programming.

Exemplars⁵

CP/M
ProDOS

Drawbacks

Programs expect to use the entire machine, because that’s how programs were run previously and that’s what the programmers were used to. The operating system must therefore isolate programs from each other (in the simplest/earliest cases, by running each job to completion or termination before loading the next).

3. Interactivity #

Year: 1958

Archetype

The TX-0 machine, one of the first transistorized computers, was installed at MIT in summer of 1958. The TX-0 had a monitor (a 512x512 CRT display), a keyboard, and a pointing device (a light pen), making it probably the first computer with a physical interface that we might recognize today. It also happens to be the machine which spawned hacker culture.

Motivation

The TX-0 was a scaled-down (transistorized) offshoot of an Air Force project called SAGE, with the ambitious goal of an electronic, automated, networked missile defense and early warning radar system. The development of interactive display computing had three main causes in this context:

it was a natural successor to the analog radar display
the on-line nature of the task demanded real-time human interaction
the importance of the task meant that funding was no object, so an entire computer (in fact, the largest and most expensive computer system ever made) could be “wasted” on providing such interactivity

Because of its transistorized circuitry, the TX-0 needed very little maintenance or oversight, and for years was left unattended at MIT for pretty much anybody to use at any time, resulting in a great flourishing of interactive programs (many of whose names began with the word “Expensive,” in an acknowledgment of the absurdity of a $3M machine being available for such experimentation).

Concept

An interactive program is one which consumes input after producing output. Prior to SAGE, once a program produced its output, it was done, and the machine would halt or move on to the next job. What distinguishes an interactive system is that it will produce some output and then wait until more input is available.

Benefits

It became possible to do creative work at a computer.

Exemplars

Drawbacks

“Waiting” is poorly specified. If a program is waiting for one kind of input, what if a different kind of input arrives instead? It will fail to respond until the kind of input it was expecting appears. This problem continues to crop up in graphics programming, network programming, and other areas.

4. Transactions #

Year: 1959

Archetype

Before computerization, American Airlines’ booking process was labor-intensive and slow. IBM realized that the basic idea behind SAGE could be applied to solve the airline reservation problem, resulting in SABRE. The core of the SABRE operating system later became known as TPF (Transaction Processing Facility).

Motivation

American wanted a system with 1,500 booking terminals across the US and Canada all linked by modem to a central reservations computer. But what if two terminals try to book the last seat on a flight at the same moment? A system like this needs strong guarantees on consistency.

Concept

Transactions are operations each guaranteed either to fail without any effect, or to run in a definite, strict order. Lots of terminals may attempt to input transactions, but every terminal must observe the same consistent state of the system, including a global transaction log listing each transaction in the precise order in which it was applied.

Benefits

This one core idea enabled the development of systems called databases, which can reliably maintain the state of complex data structures across incessant read and write operations as well as some level of hardware failures.
Modern filesystems are “journaled”, which means that they implement transactions.
Transactions are also the key idea behind version control systems, which are increasingly adopted in all corners of the software world. In that context, they are called “commits”.
Most recently, the core of crypto-currencies is a crude but clever solution to a distributed transaction processing problem. (In this context, transactions are in fact called transactions.)

Exemplars

Drawbacks

Trades performance for correctness. In some contexts, an occasional incorrect result is not as much of a problem as overall throughput.

5. Garbage Collection #

Year: 1960

Archetype

All garbage-collected environments owe a debt to Lisp, the first to provide such a facility.

Motivation

Previously, programs required the manual management of the memory resource; the programmer had to anticipate when the program would need access to more memory, and ensure that the program wouldn’t consume all the memory on the machine by not re-using memory locations that hold no-longer-needed data.

Concept

A garbage collector (GC) is a piece of software which maintains a data structure representing available memory, and marks a given memory location as available whenever it is no longer being referred to.

Benefits

The programmer doesn’t have to think about allocating and deallocating memory in order to make a working program.

Exemplars

Drawbacks

Performance becomes unpredictable due to variable GC pause times⁶.
Memory usage becomes unpredictable due to variable GC effectiveness and potential reference leaks.

6. Virtualization #

Year: 1961

Archetype

The Atlas Supervisor, developed at the University of Manchester in 1961, has been called “the first recognizable modern operating system” and “the most significant breakthrough in the history of operating systems”⁷.

Motivation

System builders wanted the capability to run multiple programs at once, mostly for the following reason:

Whilst one program is halted, awaiting completion of a magnetic tape transfer for instance, the coordinator routine switches control to the next program in the object program list which is free to proceed.

However, as mentioned earlier, programs were (and still are!) written in such a way as to assume they have a machine all to themselves. Thus, to bridge the gap, we need to provide such programs with a “virtual” environment which they do have all to themselves.

Concept

Virtualization is a general term for software facilities (possibly supported by hardware acceleration) to run programs as if they each have a computer all to themselves. Common forms include:

Virtual memory is a mechanism to translate “virtual” addresses into fetch commands against physical data stores, in such a way that each program has a whole “virtual” computer to itself, despite sharing physical memory.
A virtual machine (VM) is a relatively fast bytecode interpreter which does not enable programs to directly execute instructions on the physical machine.
In full virtualization, a virtual machine exposes the entire host machine instruction set, thus enabling native programs to run within a VM.

Benefits

Virtual memory makes it possible to only copy data from slow tiers of storage into fast tiers of storage if and when that “page” of data is needed.
Virtual memory makes it possible to persist data directly from volatile storage into nonvolatile storage “in the background,” without special handling.
Virtual memory makes it possible for processes to “share” memory without out-of-band communication.
VMs have relatively strong security guarantees; because all programs become paths through an interpreter, one need only show that the interpreter is safe to confirm that running arbitrary code within the VM is safe.

Exemplars

Multics (virtual memory)
Plan 9 (unparalleled uniformity between volatile, nonvolatile, and network storage)
Xen (full virtualization)
LuaJIT (VM)

Drawbacks

Virtual memory tries so hard to stay out of the programmer’s way that most programmers don’t even have a clear idea of what it is. As a result, its capabilities tend to be underused.
Virtual memory should have been extended to network resources, but this has not really happened.
As usually implemented, virtual memory subtly encourages the development of programs that do not talk to each other, because they are all pretending to exist in an isolated virtual memory space.

7. Hypermedia #

Year: 1968

Archetype

Doug Engelbart’s NLS introduced implementations of:

hypertext links
markup language
document version control
videoconferencing
email with hypermedia
hypermedia publishing
flexible windowing modes

Motivation

Augmenting Human Intellect

By “augmenting human intellect” we mean increasing the capability of a man to approach a complex problem situation, to gain comprehension to suit his particular needs, and to derive solutions to problems. Increased capability in this respect is taken to mean a mixture of the following: more-rapid comprehension, better comprehension, the possibility of gaining a useful degree of comprehension in a situation that previously was too complex, speedier solutions, better solutions, and the possibility of finding solutions to problems that before seemed insoluble. And by “complex situations” we include the professional problems of diplomats, executives, social scientists, life scientists, physical scientists, attorneys, designers–whether the problem situation exists for twenty minutes or twenty years. We do not speak of isolated clever tricks that help in particular situations. We refer to a way of life in an integrated domain where hunches, cut-and-try, intangibles, and the human “feel for a situation” usefully co-exist with powerful concepts, streamlined terminology and notation, sophisticated methods, and high-powered electronic aids.

Existing, or near-future, technology could certainly provide our professional problem-solvers with the artifacts they need to have for duplicating and rearranging text before their eyes, quickly and with a minimum of human effort. Even so apparently minor an advance could yield total changes in an individual’s repertoire hierarchy that would represent a great increase in over-all effectiveness. Normally the necessary equipment would enter the market slowly; changes from the expected would be small, people would change their ways of doing things a little at a time, and only gradually would their accumulated changes create markets for more radical versions of the equipment. Such an evolutionary process has been typical of the way our repertoire hierarchies have grown and formed.

But an active research effort, aimed at exploring and evaluating possible integrated changes throughout the repertoire hierarchy, could greatly accelerate this evolutionary process.

Concept

Hypermedia refers to any communications medium which comprises interactive systems. The most popular forms of hypermedia are those employing hyperlinks: certain elements of a viewed object which can be activated through interaction and whose activation triggers the display of a different object, which is determined by the hyperlink and possibly also by the interaction. For example, the World Wide Web is a form of hypermedia (hypertext), though even HTML5 is not nearly as capable as hypermedia pioneers like Ted Nelson and Doug Engelbart had probably hoped.

Benefits

Makes nonlinear communication/expression much easier
A continuum between hypermedia authoring and program authoring eases more people into being able to craft programs to solve their own problems, which is good for freedom
Could enable people to organize their own thoughts and lives more elegantly and smoothly

Exemplars

Drawbacks

It’s easy to implement bad hypermedia, like HTML.
If a software company makes good enough hypermedia, like HyperCard, it will be quickly discontinued since it will threaten the rest of the company’s product line.

8. Internetworking #

Year: 1969

Archetype

ARPAnet is the quintessential computer network. It was originally called “the Intergalactic Computer Network” and ultimately became known as simply “the Internet”.

Motivation

We had in my office three terminals to three different programs that ARPA was supporting. One was to the Systems Development Corporation in Santa Monica. There was another terminal to the Genie Project at U.C. Berkeley. The third terminal was to the C.T.S.S. project that later became the Multics project at M.I.T.

The thing that really struck me about this evolution was how these three systems caused communities to get built. People who didn’t know one another previously would now find themselves using the same system. Because the systems allowed you to share files, you could find that so-and-so was interested in such-and-such and he had some data about it. You could contact him by e-mail and, lo and behold, you would have a whole new relationship.

It wasn’t a static medium. It was a dynamic medium. And that gave it a lot of power.

There was one other trigger that turned me to the ARPAnet. For each of these three terminals, I had three different sets of user commands. So if I was talking online with someone at S.D.C. and I wanted to talk to someone I knew at Berkeley or M.I.T. about this, I had to get up from the S.D.C. terminal, go over and log into the other terminal and get in touch with them.

I said, oh, man, it’s obvious what to do: If you have these three terminals, there ought to be one terminal that goes anywhere you want to go where you have interactive computing. That idea is the ARPAnet.

–Bob Taylor (source), ARPA IPTO director

Concept

An internetwork is a set of communications channels between computers, where each computer is running a service that routes incoming messages to some other communications channel, so that each message eventually reaches its addressee. “Messages,” in this context, are generally termed “packets” (and they generally reach their destination within less than a hundred “hops”).

Benefits

Global instant email
Global instant hypertext
Global database-backed applications
Global file sharing

Exemplars

Internet Protocol

Drawbacks

Classical internetworking has no built-in economic component; arrangements between large networks must be negotiated “out of band” and encoded in a rather nasty form called BGP. As a result of this, individual people or even moderately large corporations usually cannot internetwork, but must instead purchase access to the Internet. As a result of this, most communications systems around the world are controlled by unjust oligopolies, with high barriers to competition and low barriers to various abuses of power.

Conclusion

I find that all the significant concepts in software systems were invented/discovered in the 15 years between 1955 and 1970. What have we been doing since then? Mostly making things faster, cheaper, more memory-consuming, smaller, cheaper, dramatically less efficient, more secure⁸, and worryingly glitchy. And we’ve been rehashing the same ideas over and over again. Interactivity is now “event-driven programming”. Transactions are now “concurrency primitives”. Internetworking is now “mesh networking”. Also, we have tabbed browsing now, because overlapping windows were a bad skeuomorphism from the start, and desktop notifications, because whatever is all the way in the corner of your screen is probably not very important. “Flexible view control” is relegated to the few and the proud who run something like xmonad or herbstluftwm on their custom-compiled GNU/Linux.

Many good programs have been written. Lots of really important algorithms and data structures have been invented (though usually not implemented in practice). Hardware has made so much progress. In the 1960s, a lot of good ideas were tossed out because they ran too slow, but here in 2014 everything is written in Python anyway, so let’s bring back the good old days, but now with Retina screens and multi-core gigahertz processors and tens of gigabytes of core memory. Let’s take that 20% performance hit over hand-coded assembler that was unacceptable in the 1960s, because it’s a 10x improvement over what we’re doing now.

Most of all, let’s rethink the received wisdom that you should teach your computer to do things in a programming language and run the resulting program on an operating system. A righteous operating system should be a programming language. And for goodness’ sake, let’s not use the entire network stack just to talk to another process on the same machine which is responsible for managing a database using the filesystem stack. At least let’s use shared memory (with transactional semantics, naturally – which Intel’s latest CPUs support in hardware). But if we believe in the future – if we believe in ourselves – let’s dare to ask why, anyway, does the operating system give you this “filesystem” thing that’s no good as a database and expect you to just accept that “stuff on computers goes in folders, lah”? Any decent software environment ought to have a fully featured database, built in, and no need for a “filesystem”.

Reject the notion that one program talking to another should have to invoke some “input/output” API. You’re the human, and you own this machine. You get to say who talks to what when, why, and how if you please. All this software stuff we’re expected to deal with – files, sockets, function calls – was just invented by other mortal people, like you and I, without using any tools we don’t have the equivalent of fifty thousand of. Let’s do some old-school hacking on our new-school hardware – like the original TX-0 hackers, in assembly, from the ground up – and work towards a harmonious world where there is something new in software systems for the first time since 1969.

To be continued…

Since then, Smalltalk (SqueakNOS), Forth (colorForth), and Lisp (Genera) have all flirted with becoming operating systems, and Oberon was designed to be one from the start. But none achieved economic success, for the simple reason that none of the projects involved attempted to provide value to people. They solved technical problems to validate that their concepts can work in the real world, but did not pursue the delivery of better solutions to real-world problems than would otherwise be possible.↩
Serious embedded systems people who write machine code from scratch, this is your time to gloat. You truly deserve the title of engineer. In fact, chances are good that you hold the title “electrical engineer”. Chances are also good that whatever you engineer isn’t computers, so hear me out. On the off-chance that you are an embedded systems person who writes machine code from scratch and you do make computers or computer parts, chances are good that you are (a) the bane of some free software driver author’s existence, and/or (b) providing an incredibly hard-to-detect hideout for really clever malware. Please compel your employers to publish technical documentation freely and to use ROMs in place of FLASH so that malware can’t take over your lovingly crafted code. Now, back to our regularly scheduled tirade.↩
Yes, there are exceptions, but they’re not the ones you think. The exceptions are those derived from the work of Cliff Shaw (e.g. PLANNER, Prolog, M), those derived from APL (e.g. J, K, and arguably the UNIX shell/pipeline environment), the COMIT family (e.g. SNOBOL), and the curious corner case of Inform 7. Lisp was inspired by FORTRAN (source). ISWIM (which some programming language histories identify as the “root” of the ML family) is based on ALGOL 60 (source), which of course is based on FORTRAN. The Forth family (e.g. PostScript, Factor, Tcl via NeWS) was rooted in Lisp (source). Even COMIT was loosely inspired by FORTRAN (source). Some esolangs (“esoteric languages”, viz. languages not intended for serious use) like Befunge and Wierd are very much non-FORTRAN, but they are not seriously used by anyone. Machine code could simply be disqualified on the basis that it is not software (the subject of this article), but even all current machine languages feature stack operators, which derive from ALGOL via Burroughs Large Systems.↩
Yes, I’m aware of all this $#!*. If you want to point out that graphical programming languages exist, and they aren’t based on FORTRAN, well, they fall outside my definition of “programming language”, so there. Riddle me this: why does nobody who knows how to program in text ever want to use them? Why do they break down for anything that isn’t basically a signal processing task? Why don’t they have lambdas, zooming, or style? You know, style. Like Edward Tufte has. Style. Nobody wants to use an ugly visual programming language.↩
These are examples of non-multitasking OSes. Multitasking (as practiced today) requires a separate idea, which I cover in the section marked Virtualization.↩
This disadvantage can be mitigated significantly (or, with great effort, completely eliminated) by the careful use of incremental or concurrent garbage collectors.↩
(source)↩
I considered including cryptography as another major bullet point, but if I selected a particular algorithm (e.g. Diffie-Hellman-Merkle key exchange or the Merkle–Damgård hash function construction), then I’d have to include other important algorithms (I’ve left algorithms out of my list here; they’re not “software innovations” in the sense I mean), and if I selected “encrypted communications”, well, that surely predates computers. The fact that people started writing programs which encrypt communications is great, but it doesn’t change the software environment on the level I’m talking about. That said, I do consider the invention of the practical cryptographic hash a contender for most important innovation in computer science in the last 25 years; asymmetric cryptography is about equally important. Ralph Merkle really deserves more credit for having essentially conceived of both pillars of modern cryptography.↩

How I Think About Math,
Lecture 1: Relations

2014-03-10T11:05:40-04:00

See the slides (PDF). (You may want to use your PDF viewer’s presentation mode; there are a lot of pseudo-animations that could get annoying to scroll through.)

Update: Today, I drew up the field axioms in this notation. I’m almost to the point where I can define linearity!

Last week at Hacker School, I floated the idea of giving a presentation about linear algebra. Over a decade after taking it in college, I finally feel like I understand linear algebra well enough to express clearly, to an audience of programmers, most of the concepts from linear algebra that they might find useful.

I figured the very first thing to present would be the concept of linearity itself. After all, a linear operator is just any operator that commutes with addition and scalar multiplication. But wait– what is “commuting”? Well, no problem, “A and B commute” just means that composing A with B yields the same operator as composing B with A. But wait– what is “composing”? I could start my presentation by defining a category, but that would be unnecessarily scary given category theory’s fearsome reputation. Besides, John Baez showed me last week that categorical diagram notation has its boxes and arrows counterintuitively swapped. But wait– I could just use Baez’s new notation, instead! Then my entire discussion of linear algebra will be based on concrete, non-fearsome relations, instead of “morphisms.”

So…I got about as far as defining “commuting.” (Linear algebra will have to wait.) * * *

Note: I’m skirting the edge of what Baez’s formalism actually allows; in his work so far, diagrams always depict morphisms, rather than logical assertions. I’m still working on the semantics of quantifiers in this notation, so it’s conceivable some of the examples in these slides will change as I learn more.

Python to Scheme to Assembly,
Part 1: Recursion and Named Let

2014-02-28T14:43:58-05:00

In 2001, my favorite programming language was Python. In 2008, my favorite programming language was Scheme. In 2014, my favorite programming language is x64 assembly. For some reason, that progression tends to surprise people. Come on a journey with me.

Python

In this article, we’re going to consider a very simple toy problem: recursively summing up a list of numbers¹.

def sum_list(list):
  if len(list) == 0:
    return 0
  else:
    return list[0]+sum_list(list[1:])

 >>> sum_list(range(101))
 5050

Young Carl Gauss would be proud.

 >>> sum_list(range(1001))
 RuntimeError: maximum recursion depth exceeded

Oops.

Young programmers often learn from this type of experience that recursion sucks. (Or, as a modern young programmer might say, it doesn’t scale.) If they Google around a bit, they might find the following “solution”:

 >>> import sys
 >>> sys.setrecursionlimit(1500)
 >>> sum_list(range(1001))
 500500

If they have a good computer science teacher, though, they’ll learn that the real solution is to use something called tail recursion. This is a somewhat mysterious, seemingly arbitrary concept. If the result of your recursive call gets returned immediately, without any intervening expessions, then somehow it “doesn’t count” toward the equally arbitrary recursion depth limit. Our example above isn’t tail-recusrive because we add list[0] to sum_list(list[1:]) before returning the result. In order to make sum_list tail-recursive, we have to add an accumulator variable, which represents the sum of those numbers we’ve looked at already. We’ll call this version sum_sublist, and wrap it in a new sum_list function which calls sum_sublist with the initial accumulator 0 (initially, we haven’t looked at any numbers yet, so the sum of them is 0).

def sum_list(list):
  def sum_sublist(accum,sublist):
    if len(sublist) == 0:
      return accum
    else:
      return sum_sublist(accum+sublist[0],sublist[1:])
  return sum_sublist(0,list)

 >>> sum_list(range(101))
 5050

So far, so good.

 >>> sum_list(range(1001))
 RuntimeError: maximum recursion depth exceeded

Wait, what?

On Wednesday, April 22, 2009, Guido van Rossum wrote: > A side remark about not supporting tail recursion elimination (TRE) > immediately sparked several comments about what a pity it is that Python > doesn’t do this, including links to recent blog entries by others trying to > “prove” that TRE can be added to Python easily. So let me defend my position > (which is that I don’t want TRE in the language). If you want a short > answer, it’s simply unpythonic. Here’s the long answer:

[snipped]

Third, I don’t believe in recursion as the basis of all programming. This is a fundamental belief of certain computer scientists, especially those who love Scheme…

[snipped]

Still, if someone was determined to add TRE to CPython, they could modify the compiler roughly as follows…

In other words, the only reason this doesn’t work is that Guido van Rossum² prefers it that way. Guido, I respect your right to your opinion, but the reader and I are switching to Scheme.

Scheme

Here’s a line-by-line translation:

(define (sum_list list)
  (define (sum_sublist accum sublist)
    (cond ((null? sublist)                 ; tests if sublist has length 0
           accum )                         ; don't need return statement in Scheme
          (else
           (sum_sublist (+ accum (car sublist)) (cdr sublist)) )))
  (sum_sublist 0 list) )

 guile> (sum_list (iota 1001))
 500500

Phew! Let’s make sure that we aren’t just getting lucky with a bigger recursion limit:

 guile> (sum_list (iota 10000001))
 50000005000000

Well, isn’t that neat? If we go much bigger, it’ll take a long time, but as long as the output fits into memory, we’ll get the right answer³.

Named Let

In our last two versions of sum_list, we defined a helper function (sum_sublist), and the rest of the body of sum_list was just a single invocation of that helper function. This is an inelegant pattern⁴, which Scheme has a construct to address.

(define (sum_list list)
  (let sum_sublist ((accum 0) (sublist list))  ; the named let!
    (cond ((null? sublist)
           accum )
          (else
           (sum_sublist (+ accum (car sublist)) (cdr sublist)) ))))

Named let creates a function and invokes it (with the provided initial values) in one step. It is decidedly my favorite control structure of all time. You can have your while loops and your for loops, and your do…until loops too⁵. I’ll take named let any day, because it provides the abstraction barrier of recursion without compromising the conciseness and efficiency of iteration. In case you’re not sufficiently impressed, I discuss the delightful properties of using recursion instead of non-recursive loops below.

Assembly

Named let style translates amazingly naturally into assembly.

bits 64
; macros for readability
%define list rdi             ; by calling convention, argument shows up here
%define accum rax            ; accumulator (literally!)
%define sublist rdx
 
global sum_list
sum_list:
  mov accum, 0               ; these are the let-bindings!
  mov sublist, list
.sum_sublist:
  test sublist, sublist      ; is it NULL?
  jnz .else                  ; if not, goto else
  ret; accum                (because return value is rax by calling convention)
.else:
  add accum, [sublist]       ; ~ accum=accum+car(sublist);
  mov sublist, [sublist+8]   ; ~ sublist=cdr(sublist);
  jmp .sum_sublist           ; tail-recurse

> sum_list(from(1,100))
5050
> sum_list(from(1,10000000))
50000005000000
(Sadly, my assembler doesn’t come with its own REPL; we’re borrowing the LuaJIT REPL instead⁶.)

In fact, if I weren’t so comfortable with named let, I doubt I’d be an effective assembly coder, because assembly doesn’t really have any other iteration constructs⁷. But I don’t miss them. What would they look like, anyway?

In the next installment of Python to Scheme to Assembly, we will look at call-with-current-continuation.

Addendum: C

In this addendum, we’re going to look at the assembly for iteration, non-tail recursion, and tail recursion, as emitted by gcc, and get to the bottom of what the difference is anyway.

At the top of each C file here, we have the following:

#include 
struct number_list {
  uint64_t number;
  struct number_list *next;
};

Iteration

If I were solving this problem in the context of a C program, this is how I would do it.

uint64_t sum_list(struct number_list* list) {
  uint64_t accum = 0;
  while(list) {
    accum+=list->number;
    list=list->next;
  }
  return accum;
}

Here’s the generated assembly, translated to nasm syntax and commented.

global sum_list
sum_list:
  xor eax, eax     ; equivalent to "mov rax, 0" but faster
                   ; in C it's fine to clobber rdi instead of copying it first
  test rdi, rdi        ; <- same as ours
  jz done          ; here the "if NULL" case is at the bottom
.else:
  add rax, [rdi]       ; <- same as ours
  mov rdi, [rdi+8]     ; <- same as ours
  test rdi, rdi        ; <- same as ours, but duplicated
  jnz .else            ; <- same as ours
.done:
  rep ret          ; equivalent to "ret", but faster on old AMD chips for no good reason

This is almost identical to the assembly that I wrote, except that it clobbers one of its inputs (which is perfectly allowed by the C calling convention⁸), it uses xor instead of mov to load 0 (a solid optimization⁹), it uses rep ret (less compact and no benefit on Intel chips), and it shuffles the instructions around such that two tests are needed (almost certainly not helpful with modern branch prediction and loop detection). I haven’t run benchmarks on this, but my guess is that it would come out about even. (Both versions are eight instructions long.) I also think the shuffling makes this “iterative” version more opaque and difficult to reason about (not least because of the duplicated test) than my “named let”-style code.

Non-Tail Recursion

uint64_t sum_list(struct number_list* list) {
  if(!list) {
    return 0;
  } else {
    return list->number+sum_list(list->next);
  }
}

gcc -O3 can almost completely convert this version to iteration, so let’s look at the generated assembly from gcc -O1 to get a better sense of what it might look like in a language implementation for which the necessary optimizations are too complex to be made automatically.

global sum_list
sum_list:
  push rbx          ; preserve the current value of rbx on the stack
  mov rbx, rdi      ; replace rbx by the argument to the function, list
  mov eax, 0        ; set up 0 in the result register
  test rdi, rdi     ; check if rdi is NULL
  jz .else          ; if so go to else
  mov rdi, [rdi+8]  ; ~ list=list->next;
  call sum_list     ; sum_list(list) -> result register (rax)
  add rax, [rbx]    ; add list->number (preserved across function call) to rax
.else:
  pop rbx           ; restore the state of rbx
  ret               ; return rax

We can see immediately that some new instructions (push, pop, and call) have been introduced. These are all stack manipulation instructions¹⁰. If we carefully pretend to be the CPU running this program, we can see that it pushes the address of every number in the linked list, and then dereferences and adds them up as it pops them from the stack. This is not good; if we wanted our entire data structure to be replicated on the stack, we would have passed it by value¹¹! It’s generally the amount of memory set aside for the stack that we’ve actually run out of in the case of a recursion depth exceeded error.

Tail Recursion

What about translating the tail-recursive version into C? Like Scheme and Python, gcc supports nested function definitions (as a GNU extension to C), so this is no problem:

uint64_t sum_list(struct number_list* list) {
  uint64_t sum_sublist(uint64_t accum, struct number_list* sublist) {
    if(!sublist) {
      return accum;
    } else {
      return sum_sublist(accum+sublist->number,sublist->next);
    }
  }
  return sum_sublist(0,list);
}

Here’s what gcc -O1 gives us (translated and commented as before):

sum_sublist.1867:       ; A random constant has been added to avoid polluting the namespace. Not the best solution, but okay.
  sub rsp, 8            ; Decrement the stack by one 8-byte machine word. Seems unnecessary...
  mov rax, rdi          ; Copy first argument (rdi/"accum") into result register (rax).
  test rsi, rsi         ; Test second argument (rsi/"sublist") for nullity.
  jz .else              ; If null, goto else.
  add rdi, [rsi]        ; ~ accum = accum + sublist->number;
  mov rsi, [rsi+8]      ; sublist = sublist->next;
  call sum_sublist.1867 ; recurse. result appears in rax, ready to pass along (as the return value) to the next caller in the stack.
.else:
  add rsp, 8            ; seems unnecessary
  ret                   ; return rax
 
sum_list:
  sub rsp, 8
  mov rsi, rdi          ; first argument (rdi/"list") of sum_list becomes 2nd argument (rsi/"sublist") of sum_sublist
  mov rdi, 0            ; first argument (rdi/"accum") of sum_sublist is 0
  call sum_sublist.1867 ; call sum_sublist!
  add rsp, 8
  ret

In this mode, the tail call is not being eliminated – although we’re no longer pushing rbx, we’re still pushing rip to stack with every call, and eventually we’ll run out of stack that way. The only way to get around this is to replace each call with jmp: since we’re just going to take the return value of the next recursive invocation and then immediately ret back to the previous caller on the stack, there’s no point in even inserting our own address on the stack (as call does); we can just set up the next guy to pass the return value straight back to the previous guy, and quietly disappear.

gcc -O3 does this. In fact, somewhat surprisingly, it generates exactly the same assembly, line for line, for this version as for the purely iterative version above. That’s “tail call optimization” (TCO) or “tail recursion elimination” (TRE) in its most agressive form: it literally just gets rid of all calls and recursions and replaces them with an equivalent iteration (complete with duplicate test).

The upshot of all this is that not only does Scheme’s “named let” recursion form translate neatly into assembly, it provides – penalty-free – a better abstraction than either iteration (while-loop imitation) or stack-driven recursion, the two options gcc appears to pick from when dealing with various ways to code a list traversal.

Actually, the real point I’m trying to make here is that, unlike in C, I can naturally do named let directly in assembly, and that’s one of the many reasons working in assembly makes me happy.

Appendix: What’s so great about recursion, anyway?

For me, the most important point in favor of a recursive representation of loops is that I find it easier to reason about correctness that way.

Any function we define ought to implement some ideal mathematical function that maps inputs to outputs¹². If our code truly does implement that ideal function, we say that the code is correct. Generally, we can break down the body of a function as a composition of smaller functions; even in imperative languages, we can think of every statement as pulling in a state of the world, making well-defined changes, and passing the new state of the world into the next statement¹³. At each step, we ask ourselves, “are the outputs of this function going to be what I want them to be?” For loops, though, this gets tricky.

What recursion does for us as aspiring writers of correct functions is automatic translation of the loop verification problem into the much nicer problem of function verification. Intuitively, you can simply assume that all invocations of a recursive function within its own body are going to Do The Right Thing, ensure that the function as a whole Does The Right Thing under that assumption, and then conclude that the function Does The Right Thing in general. If this sounds like circular reasoning, it does¹⁴; but it turns out to be valid anyway.

There are many ways to justify this procedure formally, all of which are truly mind-bending¹⁵. But once you’ve justified this procedure once, you never have to do it again (unlike ad-hoc reasoning about loops). I’ve determined that the most elegant way to explain it is by expanding our named let example into a non-recursive function, which just happens to accept as a parameter a correct¹⁶ version of itself.

(define (sum_list list)
  (define (sum_sublist_nonrec f_correct accum sublist)
    (cond ((null? sublist)
           accum )
          (else
           (f_correct f_correct (+ accum (car sublist)) (cdr sublist)) )))
  (sum_sublist_nonrec sum_sublist_nonrec 0 list) )

Now, sum_sublist_nonrec is an honest-to-goodness non-recursive function, and we can check that it is correct. Given a correct function f_correct (which takes as inputs a correct version of itself, a number, and a list, and correctly returns the sum of all the elements in the list plus the number), a number, and a list, does sum_sublist_nonrec correctly return the sum of all elements in the list plus the number? Why yes, it does. (Constructing a formal proof tree for this claim is left as an exercise for the self-punishing reader.) Note that since f_correct is assumed to already be correct, the correct version of it is still just f_correct, so we can safely pass it to itself without violating our assumptions or introducing new ones. So, sum_sublist_nonrec is correct.

Now let’s consider the correctness of sum_list. It’s supposed to add up all the numbers in list. What it actually does is to apply the (correct) function sum_sublist_nonrec, passing in a correct version of itself (check! it’s already correct), a number to add the sum of the list to (check! adding zero to the sum of the list won’t change it), and the list (check! that’s what we’re supposed to sum up).

We’ve just proved our program correct! The magic of named let is that it generates this clumsy form with a bunch of f_corrects from a compact and elegant form. In so doing, it lets us get away with much less formal reasoning while still having the confidence that it can be converted into something like what we just slogged through. Rest assured that no matter what you do with named let, no matter how complicated the construct you create, this “assume it does the right thing” technique still applies!

With one tiny caveat. We haven’t proved that the program terminates. If this technique proved termination, then we could just write

(define (do-the-right-thing x)
  (let does-the-right-thing ((x x))
    (does-the-right-thing x)))

and it would be totally correct, no matter what thing we want it to do.

Technically, everywhere I’ve said “correct”, what I mean is partially correct: if it terminates, then the output is correct. (Equivalently, it definitely won’t return something incorrect.) do-the-right-thing is, in fact, partially correct: it never returns at all, so it won’t give you any incorrect outputs!

Termination proofs of recursive functions can usually be handled by structural induction on possible inputs: you establish that it terminates for minimal elements (e.g. the empty list) and that termination for any non-minimal element is dependent only on termination for some set of smaller elements (e.g. the tail of the list). The structure that you need in order to think about termination this way is also much clearer with recursion than with iteration constructs.

If you doubt my ability to productively use assembly for more complicated toy problems, I direct you to my previous blog post.↩
Guido van Rossum is the author of Python, and the “Benevolent Dictator for Life” of its development process.↩
Unlike most language implementations, guile natively supports arbitrarily large integers.↩
Although at least it’s not as inelegant as defining the helper function outside the body of the actual function, thereby polluting the global namespace. Take advantage of nested functions!↩
You can even keep your for-each loops, which are no substitute for map and filter.↩
If you’re curious how this works, click here. But I haven’t settled on an ASM REPL solution I’m happy with – this is just a one-off hack. A more legitimate ASM REPL may be the subject of a future blog post.↩
Except for rep prefixes, which can iterate certain single instructions. I think it’s fair to say those don’t really count.↩
I find calling conventions distasteful in general. The calling convention is like a shadow API (in fact, it’s often referred to as the ABI, for application binary interface) that nobody has any control over (except the people at AMD, Intel, and Microsoft who are in a position to decide on such things) and that applies to every function, every component on every computer everywhere. What if we let people define their ABI as part of their API? Would the world come crashing down? I doubt it. You can already cause quite a bit of trouble by misusing APIs; really, both API and ABI usage ought to be formally verified, and as such ought to have much more room for flexibility than they do now. ↩
I would have applied this xor optimization too if I weren’t trying to literally translate Scheme code as an illustration.↩
“The stack” is not merely a region of memory managed by the OS (like “the heap”, its common counterpart). The stack is a hardware-accelerated mechanism deeply embedded in the CPU. There is a hardware register rsp (a.k.a. the stack pointer). A push instruction decrements rsp (usually by 8 at a time, in 64-bit mode, since pointers are expressed as numbers of 8-bit bytes, and 64/8=8) and then stores a value to [rsp]. A pop instruction retrieves a value from [rsp] and then increments rsp. A call instruction pushes the current value of rip (a.k.a. the instruction pointer, or the program counter), and then executes an unconditional jump (jmp). Finally, a ret instruction pops from the stack into rip, returning to wherever the matching call left off.↩
You may point out here that C doesn’t actually let you pass entire linked lists by value. Maybe that’s because it’s a bad idea.↩
If your function cannot be fully specified by an abstract mapping from inputs to outputs, then it is nondeterministic, which is a fancy word for “unpredictable”: there must exist some circumstances under which you cannot predict the behavior of the function, even knowing every input. Intuitively, I’m sure you can see how unpredictable software is a nightmare to debug. Controlling nondeterminism is an active field of computer science research, which is not the subject of this article. However, I hope you are at least convinced that nondeterminism is something you should avoid if possible, and that therefore you should try to design every function in your program as a proper mathematical function.

Note that I’m not talking about “purity” here – it’s fine for “outputs” to include side effects as of function exit, and for “inputs” to include states of the external world as of function entry. What’s important is that the state at function exit of anything the function modifies be uniquely determined by the state at function entry of anything that can affect its execution.↩
Unless we’re dealing with hairy scope issues like hoisting, in which case you should get rid of those first.↩
Pun intended. The sentence within which this footnote is referenced isn’t circular reasoning; it’s a tautology. Therefore, it’s an example of something that sounds like circular reasoning but is valid anyway. Of course, you shouldn’t take the existence of this cute example as evidence that the circular-sounding reasoning preceding it is not, in fact, circular. (That would be a fallacy of inappropriate generalization, which neither is nor sounds like circular reasoning.)↩
Trying to explain it for the purposes of this blog post – while making sure that I’m not missing something – took me over four hours.↩
Technically, I mean “partially correct”. This will be addressed in due time. Be patient, pedantic reader. This argument is hard enough to understand already.↩

Overkilling the 8-queens problem

2014-02-25T10:52:57-05:00

Last night, a fellow Hacker Schooler challenged me to a running-time contest on the classic eight queens puzzle. Naturally, I pulled up my trusty Intel® 64 manual and got to work. It turned out to be even faster than I expected, churning out pretty-printed output in 15ms, which is totally dominated by the time it takes the terminal to display it (it takes only 2ms if redirected to a file).

Update: Very slightly more scientific testing, spurred by curious Hacker News commenters, indicates that, without pretty-printing and other overhead, the solving time is actually closer to 11.2µs – about a factor of 7 speedup over commenter bluecalm’s C implementation.

pretty-printed output

(Click here to see the full output.)

The Approach

My solution method is heavily inspired by this paper (which, appropriately enough, concerns a beautifully insane programming language called MCPL, combining features from ML, C, and Prolog). This paper contributes two key insights about solving the 8-queens problem:

Conceptually, we can model the solution space as the leaves of a tree, where each internal node of the tree corresponds to a partial board (with the number of queens equal to the tree depth), and each parent-child link represents adding another queen at the row number corresponding to the depth of the child. Since there can only be one queen per row in a correct solution, this tree is a superset of the actual solution set.
Instead of actually constructing the tree, we can simply keep track of the current traversal state. In particular, this means we keep track of the currently occupied columns, the occupied leftward going diagonals, and the occupied rightward going diagonals, as they intersect the current row. (Each of these three state variables is eight bits of information.) In addition, we can keep track of the past traversal history of each level using a the stack.

If any of this is unclear, check out the paper, which has a beautiful diagram that there is no need for me to attempt replicating.

The Code

I’m going to go through the first¹ version of the code, which doesn’t produce the pretty boards but has most of the clever tricks. (Ironically, adding “pretty printing” made my code uglier. Maybe it’s just that I was up too late working on it.)

The heart of this algorithm is the sequence that updates the state variables as we move from one layer into the next. This whole program is small enough that it’s still practical to just set aside registers to represent most variables; in particular, rdx represents where it’s okay to place a queen at the current layer (e.g. it starts out as 0b11111111), and xmm1 (one of those fancy 128-bit registers that supports fancy new operations) stores the “occupied left diagonals”, “occupied right diagonals”, and “occupied columns” states, in that order (with “occupied columns” being the least significant word²). xmm2, xmm3, and xmm4 are just being used as scratch space. Finally, xmm7 is a constant 0xff.

Instruction Dictionary

To spare you the effort of searching through the Intel® 64 manual yourself, here are brief descriptions of all the fancy instructions I’m about to use.

vpsllw: Vector/Packed Shift Left (Logical) Words. Separately shifts left every word of the second argument by the number of bits represented as the third argument, and store the result to the first argument.
vpsrlw: Vector/Packed Shift Right (Logical) Words. Separately shifts right every word of the second argument by the number of bits represented as the third argument, and store the result to the first argument.
pblendw: Packed Blend Words. Using the third argument as a mask, selectively copy words from the second argument to the first argument.
vpsrldq: Vector/Packed Shift Right (Logical) Double Quadword. Shifts the entire second argument by the number of bytes specified in the third argument, and stores the result to the first argument.
por: Parallel OR. Bitwise ORs the first and second argument and assigns the result to the first argument.
vpandn: Vector/Parallel AND NOT. Inverts the second argument, ANDs the result with the third argument, and assigns the result of that to the first argument.
movq: Move Quadword. The standard way to move data between xmm registers and normal registers.

Now, let’s take this a few lines at a time.

8queens.asmgithub

  vpsllw xmm2, xmm1, 1      ; shift entire state to left, place in xmm2
  vpsrlw xmm3, xmm1, 1      ; shift entire state to right, place in xmm3
  pblendw xmm1, xmm2, 0b100 ; only copy "left-attacking" word back from xmm2
  pblendw xmm1, xmm3, 0b010 ; only copy "right-attacking" word back from xmm3

If you’re accustomed to C, you might think of this as functionally equivalent to something like xmm1[2] <<= 1; xmm1[1] >>=1³. We want the word in position 1 to shift right and the word in position 2 to shift left, while the word in position 0 (occupied columns) stays put.

8queens.asmgithub

  vpsrldq xmm2, xmm1, 4     ; shift state right 4 *bytes*, place in xmm2
  vpsrldq xmm3, xmm1, 2     ; shift state right 2 bytes, place in xmm3
  por xmm2, xmm3            ; collect bitwise ors in xmm2
  por xmm2, xmm1

Now, we want to combine the information about which squares in the next layer are under attack. It doesn’t matter from which direction – we want to make sure not to put a queen there. So, we shift right 2 words (= 4 bytes) and right 1 word (= 2 bytes) and OR them all together (accumulating into a scratch register so we don’t clobber our state).

8queens.asmgithub

  vpandn xmm4, xmm2, xmm7   ; invert and select low byte
  movq rdx, xmm4            ; place in rdx
  jmp next_state           ; now we're set up to iterate

But that still contains some stuff in the upper bytes. We only want the lower byte. And we also want 1 bits where queens should be allowed, rather than where they’re under attack. We can solve both problems with one vpandn instruction, which will flip all the bits, but mask out everything except the first byte (since xmm7=0xff).

So, now that we’re iterating, what happens next?

Instruction Dictionary

bsf: Bit Scan Forward. Finds the least significant 1 bit in the second argument and stores the index of that bit into the first argument. If there is no 1 bit the second argument, the value of the first argument is undefined, and the zero flag (ZF) is set.
btc: Bit Clear. Clears the bit in the first argument with index given by the second argument.
je: Jump If Equal. Pretty self-explanatory, when used in conjunction with cmp (Compare).
jz: Jump If Zero. Jumps to the specified address/label if the zero flag (ZF) is set.
push: Push To Stack. Stores its single argument to the memory location pointed by rsp, and decrements rsp (usually by eight at a time, i.e., rsp <- rsp-8).
shl: Logical Shift Left for non-xmm registers.

8queens.asmgithub

next_state:
  bsf rcx, rdx             ; find next available position in current level
  jz backtrack             ; if there is no available position, we must go back
  btc rdx, rcx             ; mark position as unavailable
  cmp rsp, r14             ; check if we've done 7 levels already
  je win                   ; if so, we have a win state. otherwise continue
  movq r10, xmm1           ; save current state ...
  push rdx
  push r10                 ;   ... to stack
  mov rax, r15             ; set up attack mask
  shl rax, cl              ; shift into position
  movq xmm2, rax
  por xmm1, xmm2           ; mark as attacking in all directions

First we try scanning for an available position on this row – one that isn’t under attack from already-placed queens, and that also hasn’t already been visited. If there is none, then we have no choice but to backtrack (a little piece of code which is coming up soon). Assuming we find an available position, we first mark it as visited/unavailable. We then check if this is the last level that needs to be taken care of, by looking at the stack pointer. Since the stack gets deeper by 16 bytes with every level, this test⁴ is easily set up at program initialization. If the test is true, then we’ve discovered a solution, or “win state” – so we go ahead to the “win” code.

If we’ve neither succeeded nor failed, it means we just have to go another level down in the tree. In order to have an efficient backtracking capability, we store our state variables on the stack, so they can be restored when everything fails deeper down in the tree. Finally, we update our model of which squares are in danger by adding the queen we’re currently placing as a column-occupier and diagonal-occupier (modifying all three state variables at once with the magic of por). Note here that cl is just a name for the least significant byte of the rcx register, which houses the horizontal position of the new queen.

What if we have to backtrack?

8queens.asmgithub

backtrack:
  cmp rsp, r13             ; are we done?
  je done
  pop r10                  ; restore last state
  pop rdx
  movq xmm1, r10
  jmp next_state           ; try again

First, we have another stack-pointer test - if we’ve tried to backtrack past the start of the program, then we know we’ve exhausted all possibilities and just go to done. Assuming that’s not at issue, we simply restore the rdx and xmm1 variables (using r10 as scratch storage since one can’t directly pop xmm registers). Then we just jump back into our loop, with a new state ready to go!

Now we’re ready to look at the whole solution in context:

8queens.asmgithub

%include "os_dependent_stuff.asm"
  mov rdx, 0b11111111      ; all eight possibilities available
  mov r8, 0x000000000000   ; no squares under attack from anywhere
  movq xmm1, r8            ; maintain this state in xmm1
  mov r15, 0x000100010001  ; attack mask for one queen (left, right, and center)
  mov r14, 0xff            ; mask for low byte
  movq xmm7, r14           ; stored in xmm register
  mov r13, rsp             ; current stack pointer (if we backtrack here, then
  mov r14, rsp             ;   the entire solution space has been explored)
  sub r14, 2*8*7           ; this is where the stack pointer would be when we've
                           ;   completed a winning state
next_state:
  bsf rcx, rdx             ; find next available position in current level
  jz backtrack             ; if there is no available position, we must go back
  btc rdx, rcx             ; mark position as unavailable
  cmp rsp, r14             ; check if we've done 7 levels already
  je win                   ; if so, we have a win state. otherwise continue
  movq r10, xmm1           ; save current state ...
  push rdx
  push r10                 ;   ... to stack
  mov rax, r15             ; set up attack mask
  shl rax, cl              ; shift into position
  movq xmm2, rax
  por xmm1, xmm2           ; mark as attacking in all directions
  vpsllw xmm2, xmm1, 1      ; shift entire state to left, place in xmm2
  vpsrlw xmm3, xmm1, 1      ; shift entire state to right, place in xmm3
  pblendw xmm1, xmm2, 0b100 ; only copy "left-attacking" word back from xmm2
  pblendw xmm1, xmm3, 0b010 ; only copy "right-attacking" word back from xmm3
  vpsrldq xmm2, xmm1, 4     ; shift state right 4 *bytes*, place in xmm2
  vpsrldq xmm3, xmm1, 2     ; shift state right 2 bytes, place in xmm3
  por xmm2, xmm3            ; collect bitwise ors in xmm2
  por xmm2, xmm1
  vpandn xmm4, xmm2, xmm7   ; invert and select low byte
  movq rdx, xmm4            ; place in rdx
  jmp next_state           ; now we're set up to iterate
 
backtrack:
  cmp rsp, r13             ; are we done?
  je done
  pop r10                  ; restore last state
  pop rdx
  movq xmm1, r10
  jmp next_state           ; try again
 
win:
  inc r8                   ; increment solution counter
  jmp next_state           ; keep going
 
done:
  mov rdi, r8              ; set system call argument to solution count
  mov rax, SYSCALL_EXIT    ; set system call to exit
  syscall                  ; this will exit with our solution count as status

If you’re curious to investigate further, run the code yourself ⁵ and/or check out the more complicated, pretty-printing version.

Somewhat surprisingly, the first version actually worked.↩
A word is two bytes. Why did I use words and not just bytes? The answer is that some of the fancy instructions we want to use don’t allow us to work with data elements any smaller than words.↩
But it’s all taking place in the register file – no memory accesses here!↩
That is to say, the value of r14.↩
Requires a recent (Sandy Bridge or later) Intel CPU.↩

Relocatable vs. Position-Independent Code (or, Virtual Memory isn't Just For Swap)

2014-02-19T17:12:50-05:00

Myth: “Virtual memory” is the mechanism that a kernel uses to make more memory available than is actually physically installed, by setting aside a disk partition for the overflow and copying pages between memory and disk as needed.

I acquired this belief very early in my programming career, but it turns out that swapping pages to disk is merely one of the many things that “virtual memory” makes possible.

Fact: “Virtual memory” is a hardware (CPU) mechanism, which, every single time memory is accessed, references a kernel-specified data structure called a “page table” to arbitrarily frobnicate the high bits of the address, which is called “translating” from a “linear address” to a “physical address”. (The page table gets cached by a translation lookaside buffer, so the lookup is usually quite efficient!)

This fact became very real to me this week as I made a kernel from scratch: I was moderately surprised that I needed to set up a page table, when I had always thought of virtual memory as a somewhat advanced kernel feature. Today, I learned how “relocatable” and “PIC” – terms I’d encountered in the past and never really understood – suddenly make sense in this context.

Here’s another fact that surprised me: in conventional operating systems, every process has its own page table. The pointer 0x7fff8000 does not necessarily translate to the same physical address in one process as it does in another¹.

Now, let’s talk about libraries. Libraries are code, but they don’t run as processes of their own. They’re going to wind up under someone else’s page table. There’s two ways that can happen: static linking and dynamic linking².

If a library is statically linked, the linker finds some place in a code segment of the executable to situate the library. The loader will then place this segment in virtual memory (wherever it’s explicitly specified to go) when the executable is run.
If a library is dynamically linked, then when the loader sets up the executable, it will invoke the dynamic linker to make sure that the required library shows up some place in the process’s virtual memory³.

Whether static or dynamic, a linked library is going to be situated in virtual memory somewhere that the library can’t predict⁴, which is problematic for accessing its own memory. Fortunately, the linker (whether static or dynamic) can help us out by relocating the library’s code, so that it knows where it is. Unfortunately, library writers have to help the linker out by specifying, in the object file, the set of instructions or initialized data that need to be modified to properly relocate it. As long as all that “relocation information” is present, the object file is said to be relocatable.

On the other hand, position-independent code (PIC), as the name suggests, doesn’t even need to be relocated. None of its instructions or initialized data encode any assumptions about the region of virtual memory the program will be loaded into; it figures out where it is (usually by referencing the instruction pointer) and makes all memory accesses based on what it finds out.

So why do all that work when the linker can relocate for us?

Here’s the kicker. The whole motivation for dynamic linking was shared libraries. Shared doesn’t just mean that multiple programs reference the same library file on disk. It means those processes share that library in physical memory⁵. Since every process has its own page table, the exact same library code winds up executing as if it were loaded into multiple, inconsistent virtual memory locations. If we relocated it for one process, it wouldn’t necessarily be valid for another. This is why weird things sometimes happen where the solution is “recompile blah with -fPIC”.

Perhaps the most interesting thing about all this is that in today’s 64-bit age, position-independent code may not even be necessary. The available virtual memory address space with 64 bits is so large that an OS may be able to afford blocking off a region of every process’s virtual memory space to host every shared library on the system, so that their linear locations are guaranteed to be consistent from process to process. That means shared libraries would still have to be relocatable, but they wouldn’t have to be PIC.

On the other hand, x86_64 makes it significantly easier to write position-independent code, by referring addresses to the current program counter (so no matter what virtual memory offset the code is at, it’s internally consistent). If we adopt a policy that all libraries (static and dynamic) are PIC, then libraries don’t ever have to worry about being relocated and the linker gets a lot simpler.

This is one of the things that differentiates a “process” from a “thread”: threads don’t have their own page tables.↩
Just as with static typechecking and dynamic typechecking, “static” means that it happens before the program is invoked, and “dynamic” means that happens after the program is invoked.↩
The loader also needs to populate a series of “slots” at fixed addresses with instructions that jump into where the library is (since the executable won’t know in advance where the library will show up, unlike with static linking). But that part of dynamic linking is a distraction for the discussion of relocatable vs. PIC.↩
unlike a stand-alone executable, which can request (almost) any virtual memory address that it wants (since it has the whole page table to itself)↩
In fact, in most operating systems, if multiple processes map the same file into their virtual memory, and none of them write to it, those processes’ page tables will translate each of their process-specific addresses for that file to the same pages of physical memory.↩

Kernel from Scratch

2014-02-18T02:58:08-05:00

One of my 3 major goals for Hacker School was to create a bootable, 64-bit kernel image from scratch, using only nasm and my text editor. Well, folks, one down, two to go.

The NASM/x64 assembly code is listed below, with copious comments for your pleasure. It comprises 136 lines including comments; 75 with comments removed. You may wish to refer to the Intel® 64 Software Developers’ Manual (16.5MB PDF), especially if you’re interested in doing something similar yourself. Building and running is as simple as

$ nasm boot.asm -o bootable.bin
$ qemu-system-x86_64 bootable.bin

That is, assuming that you have recent versions of nasm and qemu installed.

Let’s get to the code!

boot.asmraw

bits 16
org 0x7c00
k_boot_start:
 
  ; The cli instruction disables maskable external interrupts.
  cli
 
  ; Fetch Control Register 0, set bit 0 to 1 (Protection Enable bit)
  ; This basically enables 32-bit mode
  mov eax, cr0
  or al, 1
  mov cr0, eax
 
  ; Now we have to jump into the 32-bit zone. The 0x08 is a 386-style segment
  ; descriptor, which theoretically references the Global Descriptor Table,
  ; though in this bare-bones bootloader we haven't even bothered to set that
  ; up yet and it works anyway.
  jmp 0x08:k_32_bits
 
bits 32
k_32_bits:
 
  ; Now we're going to set up the page tables for 64-bit mode.
  ; Since this is a minimal example, we're just going to set up a single page.
  ; The 64-bit page table uses four levels of paging,
  ;    PML4E table => PDPTE table => PDE table => PTE table => physical addr
  ; You don't have to use all of them, but you have to use at least the first
  ; three. So we're going to set up PML4E, PDPTE, and PDE tables here, each
  ; with a single entry.
%define PML4E_ADDR 0x8000
%define PDPTE_ADDR 0x9000
%define PDE_ADDR 0xa000
  ; Set up PML4 entry, which will point to PDPT entry.
  mov dword eax, PDPTE_ADDR
  ; The low 12 bits of the PML4E entry are zeroed out when it's dereferenced,
  ; and used to encode metadata instead. Here we're setting the Present and
  ; Read/Write bits. You might also want to set the User bit, if you want a
  ; page to remain accessible in user-mode code.
  or dword eax, 0b011  ; Would be 0b111 to set User bit also
  mov dword [PML4E_ADDR], eax
  ; Although we're in 32-bit mode, the table entry is 64 bits. We can just zero
  ; out the upper bits in this case.
  mov dword [PML4E_ADDR+4], 0
  ; Set up PDPT entry, which will point to PD entry.
  mov dword eax, PDE_ADDR
  or dword eax, 0b011
  mov dword [PDPTE_ADDR], eax
  mov dword [PDPTE_ADDR+4], 0
  ; Set up PD entry, which will point to the first 2MB page (0).  But we
  ; need to set three bits this time, Present, Read/Write and Page Size (to
  ; indicate that this is the last level of paging in use).
  mov dword [PDE_ADDR], 0b10000011
  mov dword [PDE_ADDR+4], 0
 
  ; Enable PGE and PAE bits of CR4 to get 64-bit paging available.
  mov eax, 0b10100000
  mov cr4, eax
 
  ; Set master (PML4) page table in CR3.
  mov eax, PML4E_ADDR
  mov cr3, eax
 
  ; Set IA-32e Mode Enable (read: 64-bit mode enable) in the "model-specific
  ; register" (MSR) called Extended Features Enable (EFER).
  mov ecx, 0xc0000080
  rdmsr ; takes ecx as argument, deposits contents of MSR into eax
  or eax, 0b100000000
  wrmsr ; exactly the reverse of rdmsr
 
  ; Enable PG flag of CR0 to actually turn on paging.
  mov eax, cr0
  or eax, 0x80000000
  mov cr0, eax
 
  ; Load Global Descriptor Table (outdated access control, but needs to be set)
  lgdt [gdt_hdr]
 
  ; Jump into 64-bit zone.
  jmp 0x08:k_64_bits
 
bits 64
k_64_bits:
  mov rdi, 0xb8000 ; This is the beginning of "video memory."
  mov rdx, rdi     ; We'll save that value for later, too.
  mov rcx, 80*25   ; This is how many characters are on the screen.
  mov ax, 0x7400   ; Video memory uses 2 bytes per character. The high byte
                   ; determines foreground and background colors. See also
; http://en.wikipedia.org/wiki/List_of_8-bit_computer_hardware_palettes#CGA
                   ; In this case, we're setting red-on-gray (MIT colors!)
  rep stosw        ; Copies whatever is in ax to [rdi], rcx times.
 
  mov rdi, rdx       ; Restore rdi to the beginning of video memory.
  mov rsi, hello     ; Point rsi ("source" of string instructions) at string.
  mov rbx, hello_end ; Put end of string in rbx for comparison purposes.
hello_loop:
  movsb              ; Moves a byte from [rsi] to [rdi], increments rsi and rdi.
  inc rdi            ; Increment rdi again to skip over the color-control byte.
  cmp rsi, rbx       ; Check if we've reached the end of the string.
  jne hello_loop     ; If not, loop.
  hlt                ; If so, halt.
 
hello:
  db "Hello, kernel!"
hello_end:
 
; Global descriptor table entry format
; See Intel 64 Software Developers' Manual, Vol. 3A, Figure 3-8
; or http://en.wikipedia.org/wiki/Global_Descriptor_Table
%macro GDT_ENTRY 4
  ; %1 is base address, %2 is segment limit, %3 is flags, %4 is type.
  dw %2 & 0xffff
  dw %1 & 0xffff
  db (%1 >> 16) & 0xff
  db %4 | ((%3 << 4) & 0xf0)
  db (%3 & 0xf0) | ((%2 >> 16) & 0x0f)
  db %1 >> 24
%endmacro
%define EXECUTE_READ 0b1010
%define READ_WRITE 0b0010
%define RING0 0b10101001 ; Flags set: Granularity, 64-bit, Present, S; Ring=00
                   ; Note: Ring is determined by bits 1 and 2 (the only "00")
 
; Global descriptor table (loaded by lgdt instruction)
gdt_hdr:
  dw gdt_end - gdt - 1
  dd gdt
gdt:
  GDT_ENTRY 0, 0, 0, 0
  GDT_ENTRY 0, 0xffffff, RING0, EXECUTE_READ
  GDT_ENTRY 0, 0xffffff, RING0, READ_WRITE
  ; You'd want to have entries for other rings here, if you were using them.
gdt_end:
 
; Very important - mark the sector as bootable. 
times 512 - 2 - ($ - $$) db 0 ; zero-pad the 512-byte sector to the last 2 bytes
dw 0xaa55 ; Magic "boot signature"

Octopress workflow

2014-02-11T18:10:58-05:00

Today is my second day at Hacker School, and I decided to set up a little bit of tooling for blogging about what I do here. The first tool I set up (following the recommendations of many Hacker Schoolers and alums) was Octopress, a static site generator designed for GitHub Pages and implemented atop Jekyll. (The page you’re reading right now is Octopress-generated.) I followed the admirably thorough Octopress documentation for installation, initial configuration, deployment with Github Pages, and theme customization ¹. But I wanted even more convenience. So, I’m here to introduce you to the blog command (the same one I used to write this very post).

davidad@zayin ~/octopress> blog
Enter a title for your post:

blog is a bash script, pretty specific to my own setup (vim, chrome, OSX), but it could be adapted to other environments. blog can create a post using Octopress’ new_post[] Rake target (and you can specify a title on the command line if you want), then it opens vim in sort of git commit-ish fashion, with your cursor on the last line ready to press o and start typing your post, and with magical deployment when you :wq². It also implements blog deploy (runs both generate and deploy), blog delete, and editing existing posts. Most importantly, whenever editing the script sets up a keybinding for C-g that saves your draft post and refreshes the local preview in a Chrome window. It does this even if you don’t have a tab open to refresh, but it also won’t open a new one if you do. And it keeps your vim window in the foreground. How does this work? You might expect that Chrome has a nice command-line remote interface for exactly this sort of thing. Sadly, that is not the case. However, Apple has had the foresight to allow command-driven automation of actions which can typically only be carried out graphically. Sadly again, that mechanism is AppleScript, a historical relic of a programming language.

Reloading a website in Chrome from AppleScript

tell application "Google Chrome"
    if (count every window) = 0 then
        make new window
    end if
 
    set found to false
    set theTabIndex to -1
    repeat with theWindow in every window
        set theTabIndex to 0
        repeat with theTab in every tab of theWindow
            set theTabIndex to theTabIndex + 1
            if theTab's URL contains "$1" then
                set found to true
                exit
            end if
        end repeat
 
        if found then
            exit repeat
        end if
    end repeat
 
    if found then
        tell theTab to reload
        $L1
    else
        $L2
    end if
end tell

In this snippet, $1 is going to get replaced with the site’s top-level URL (like http://localhost:4000/ for the local preview server, or http://davidad.github.io/ for the deployment). $L1 and $L2 are placeholders for two actions that we might not always want³: changing the current tab to the tab we just refreshed, and opening up a new tab if there wasn’t already one for this site. It’s also worth noting that this script will reload the first tab that contains the URL – so if you have an open tab pointed at a particular page on the site, you won’t lose your place⁴.

The interface to AppleScript is the osascript command, which accepts an AppleScript file as its argument⁵. So, the first big chunk of the blog script is dedicated to producing script files. It’s implemented as a function which fills in the “holes” in the script described above.

function wrs() {
    if [[ $2 = "y" ]]; then
        L1="set theWindow's active tab index to theTabIndex"
        L2="tell window 1 to make new tab with properties {URL:\"$1\"}"
    else
        L1=""
        L2=""
    fi
    cat >.reload.scpt <
delay 1.5
tell application "Google Chrome"
    
    if (count every window) = 0 then
        make new window
    end if
    
    set found to false
    set theTabIndex to -1
    repeat with theWindow in every window
        set theTabIndex to 0
        repeat with theTab in every tab of theWindow
            set theTabIndex to theTabIndex + 1
            if theTab's URL contains "$1" then
                set found to true
                exit
            end if
        end repeat
        
        if found then
            exit repeat
        end if
    end repeat
    
    if found then
        tell theTab to reload
        $L1
    else
        $L2
    end if
end tell
EOF
}
wrs 'http://localhost:4000/' y

The delay 1.5 line exists to give Octopress enough time to do its thing before trying to reload Chrome. Octopress is pretty slow.

In the next chunk, we handle the delete and deploy actions:

ORIGDIR=`pwd | sed 's/\ /\\ /g'`
cd ~/octopress
 
URL="http://davidad.github.io/"

if [[ $1 = delete ]]; then
    [[ -f $2 ]] && rm -i $2 && bundle exec rake generate && exec $0 deploy
    exit 0
elif [[ $1 = deploy ]]; then
    bundle exec rake deploy \
    && wrs $URL y && sleep 5 && osascript ./.reload.scpt \
    && rm -f ./.reload.scpt .timeref rake_preview.log \
    && git add . \
    && git commit -m "Site updated at `date -u +"%Y-%m-%d %H:%M:%S UTC"`" \
    && git push
    exit 0
fi

In the case of delete, we use rm -i to ask the user to confirm the deletion, and if they do, we generate and then call the script itself ($0) with the deploy action (so as not to duplicate code). The deploy action deploys the generated site (to GitHub Pages), writes out a refresh script for the deployed site, waits an extra few seconds for GitHub Pages to do its thing, and then runs the reload script. Finally,blog commits and pushes the source branch of the repository, after cleaning up its temporary files – the reload script, the log from Octopress’ local preview server, and the time reference (which we’ll come to shortly).

[[ -f $1 ]] && rm -f new_post.md && ln -s $1 new_post.md
[[ -f $1 ]] || bundle exec rake "new_post[$1]"

We’re managing a symbolic link called new_post.md here, which is what we’re going to call vim on. If a filename is specified, we point the link directly at that file. Otherwise, we’re going to call rake to set up the file. By default, rake won’t give any indication to our script of what file it made, so we’re going to make a tweak to the Rakefile:

@@ -104,9 +89,7 @@ task :new_post, :title do |t, args|
   raise "### You haven't set anything up yet. First run `rake install` to set up an Octopress theme." unless File.directory?(source_dir)
   mkdir_p "#{source_dir}/#{posts_dir}"
   filename = "#{source_dir}/#{posts_dir}/#{Time.now.strftime('%Y-%m-%d')}-#{title.to_url}.#{new_post_ext}"
-  if File.exist?(filename)
-    abort("rake aborted!") if ask("#{filename} already exists. Do you want to overwrite?", ['y', 'n']) == 'n'
-  end
+  if not (File.exist?(filename) and ask("#{filename} already exists. Do you want to overwrite?", ['y', 'n']) == 'n')
     puts "Creating new post: #{filename}"
     open(filename, 'w') do |post|
       post.puts "---"
       post.puts "layout: post"
       post.puts "title: \"#{title.gsub(/&/,'&')}\""
       post.puts "date: #{Time.now.strftime('%Y-%m-%d %H:%M:%S %z')}"
       post.puts "comments: true"
       post.puts "categories: "
       post.puts "---"
     end
+  end
+  system "rm -f new_post.md"
+  system "ln -s #{filename} new_post.md"
 end

The first changeset handles the case where I don’t want to overwrite the existing post, but I do want to proceed to edit it (and deploy the edits). The last two lines simply point new_post.md at the right spot so our script can call vim on it. Before we call vim, though, we have to set up the deploy-on-save feature and the live(ish)-preview feature…

touch -m .timeref

.timeref is an empty file which keeps track of the time slightly before vim was launched. In a “successful” session, the modification time of the post file should be newer than .timeref, whereas if you :q! immediately, it won’t be. Now, it’s worth pointing out that the live-preview requires saving along the way, so if you want to abort after previewing, use :cq, vim’s command for exiting with a nonzero status code (so the shell script knows what’s up). The script supports both mechanisms, so that if you are aborting immediately but forget to :cq, The Right Thing should happen.

manage preview processes

ps x | egrep 'rake|rackup|jekyll|sass|compass' | grep -v grep | awk '{ print $1 }' | xargs kill
ps x | egrep 'rackup' | grep -v grep | awk '{ print $1 }' | xargs kill -9
bundle exec rake preview > rake_preview.log 2>&1 &

Now we’re going to kill off any existing preview processes (they really start to pile up otherwise!) and launch a new one. We also log its stdout and stderr so you can see what the preview process is up to if you want (tail -f rake_preview.log).

sleep 0.3
osascript ./.reload.scpt

We give the preview process a little time to get started and then display the preview in the browser so the user knows what they’re working from.

Run vim

vim -c 'set tw=80' -c 'map  :w:!osascript ./.reload.scpt' \
    -c "cd $ORIGDIR" + new_post.md
VIM_STATUS=$?
[[ `readlink new_post.md` -nt .timeref ]] || VIM_STATUS=1
[ $VIM_STATUS -eq 0 ] && osascript ./.reload.scpt && exec $0 deploy && exit 0
[ $VIM_STATUS -ne 0 ] && wrs 'http://localhost:4000/' n \
    && [ -f new_post.md ] && rm -i `readlink new_post.md` \
    && git rm --ignore-unmatch new_post.md \
    && sleep 0.4 && osascript ./.reload.scpt

This is the last piece of the script, where we actually run vim and then take the appropriate action after it exits. We’re giving vim a number of commands on the command line, including setting auto-wrapping at 80 columns (tw=80), scrolling to the bottom of the file (+), and changing to the directory the script was run from (set all the way back on line 3). Most importantly, we’re forcing a normal-mode mapping of C-g to the reload script!

Once vim exits, we capture its return code with $?. Then we check if the file has actually been saved. Either it has, or (||) the status really ought to be nonzero. If the status is still 0, then we do one final preview and shift into deploy mode. Otherwise, we remove the file that new_post.md points to, remove new_post.md itself, and reload⁶.

Putting it all together

/usr/bin/bloggist

#!/bin/bash
 
ORIGDIR=`pwd | sed 's/\ /\\ /g'`
cd ~/octopress
 
URL="http://davidad.github.io/"
 
function wrs() {
    if [[ $2 = "y" ]]; then
        L1="set theWindow's active tab index to theTabIndex"
        L2="tell window 1 to make new tab with properties {URL:\"$1\"}"
    else
        L1=""
        L2=""
    fi
    cat >.reload.scpt <
delay 1.5
tell application "Google Chrome"
    
    if (count every window) = 0 then
        make new window
    end if
    
    set found to false
    set theTabIndex to -1
    repeat with theWindow in every window
        set theTabIndex to 0
        repeat with theTab in every tab of theWindow
            set theTabIndex to theTabIndex + 1
            if theTab's URL contains "$1" then
                set found to true
                exit
            end if
        end repeat
        
        if found then
            exit repeat
        end if
    end repeat
    
    if found then
        tell theTab to reload
        $L1
    else
        $L2
    end if
end tell
EOF
}
wrs 'http://localhost:4000/' y
 
 
if [[ $1 = delete ]]; then
    [[ -f $2 ]] && rm -i $2 && bundle exec rake generate && exec $0 deploy
    exit 0
elif [[ $1 = deploy ]]; then
    bundle exec rake deploy \
    && wrs $URL y && sleep 5 && osascript ./.reload.scpt \
    && rm -f ./.reload.scpt .timeref rake_preview.log \
    && git add . \
    && git commit -m "Site updated at `date -u +"%Y-%m-%d %H:%M:%S UTC"`" \
    && git push
    exit 0
fi
 
[[ -f $1 ]] && rm -f new_post.md && ln -s $1 new_post.md
[[ -f $1 ]] || bundle exec rake "new_post[$1]"
 
touch -m .timeref
ps x | egrep 'rake|rackup|jekyll|sass|compass' | grep -v grep | awk '{ print $1 }' | xargs kill
ps x | egrep 'rackup' | grep -v grep | awk '{ print $1 }' | xargs kill -9
sleep 0.15
bundle exec rake preview < /dev/zero > rake_preview.log 2>&1 &
sleep 0.3
osascript ./.reload.scpt
 
vim -c 'set tw=80' -c 'map  :w:!osascript ./.reload.scpt' \
    -c "cd $ORIGDIR" + new_post.md
VIM_STATUS=$?
[[ `readlink new_post.md` -nt .timeref ]] || VIM_STATUS=1
[ $VIM_STATUS -eq 0 ] && osascript ./.reload.scpt && exec $0 deploy && exit 0
[ $VIM_STATUS -ne 0 ] && wrs 'http://localhost:4000/' n \
    && [ -f new_post.md ] && rm -i `readlink new_post.md` \
    && git rm --ignore-unmatch new_post.md \
    && sleep 0.4 && osascript ./.reload.scpt

All of the files for theming etc. are available here. I’ve spent way too much time tweaking the CSS, and fixing various peeves with the way Octopress renders – I could write an entire other blog post about that, but I probably won’t.↩
Or :x. My muscle memory has been :wq for many years and I haven’t yet made a serious effort to retrain.↩
One example where we don’t want these actions is if the blog post was aborted. Then there’s no sense in tabbing back to the preview just to show that it’s gone, but if the user is looking at the preview anyway, may as well refresh it to reflect the abort.↩
Chrome will even restore your scroll position once the refresh is finished.↩
You can also pass AppleScript on osascript’s command line using the -e option, but only one line of AppleScript at a time. And since there’s no statement separator in AppleScript, we can’t easily transform an arbitrary script into a one-liner (like we could in bash, or many other more sensible languages).↩
using a newly generated AppleScript which won’t cause Chrome to switch the active tab, in case the abort was related to something else having come up.↩

Technical Journal

An OSI layer model for the 21st century

Data Link and Physical layers

Integrity layer

Availability layer

Confidentiality layer

Non-Repudiation and/or Repudiation layer

Transactions layer

Application layer

Conclusion

All Boolean functions are polynomials

Getting started with nginx configuration

Next Steps

Reloading

VNC as a graphical interface medium

Concurrency Primitives in Intel 64 Assembly

Critical Sections #

Tasks and Workers #

Show me the code already! #

open(), ftruncate(), and mmap() #

lock add #

fork() #

The parent process #

Dividing up work #

lock cmpxchg (compare-and-swap) #

“test-and-test-and-set” #

Doing the task #

Being done #

lock dec #

What should have been done differently(if this weren’t just an example) #

Conclusion #

The Security/Product Design Correspondence

Systems Past: the only 8 software innovations we actually use

1. The Programming Language #

Archetype

Motivation

Concept

Benefits

Exemplars

Drawbacks

2. The Operating System #

Archetype

Motivation

Concept

Benefits

Exemplars5

Drawbacks

3. Interactivity #

Archetype

Motivation

Concept

Benefits

Exemplars

Drawbacks

4. Transactions #

Archetype

Motivation

Concept

Benefits

Exemplars

Drawbacks

5. Garbage Collection #

Archetype

Motivation

Concept

Benefits

Exemplars

Drawbacks

6. Virtualization #

Archetype

Motivation

Concept

Benefits

Exemplars

Drawbacks

7. Hypermedia #

Archetype

Motivation

Concept

Benefits

`open()`, `ftruncate()`, and `mmap()` #

`lock add` #

`fork()` #

`lock cmpxchg` (compare-and-swap) #

`lock dec` #

What should have been done differently
(if this weren’t just an example) #

Exemplars⁵

How I Think About Math,
Lecture 1: Relations

Python to Scheme to Assembly,
Part 1: Recursion and Named Let