“Ah, it’s time to order a more practical watch!”

Welcome to the latest edition of my on-going series in Cray-related computational necromancy. This was another just-for-fun project. Over the many years Andras and I have been working on our Cray revival efforts, there has always been a big lag between our available software (and Andras’ amazing simulator) and my FPGA-compatible Verilog implementation. My free time and interests have oscillated wildly over the years, and implementing and debugging hardware is generally everything-intensive. My original cycle-accurate (but buggy) Cray-1 RTL gradually morphed first into a binary-compatible, but less-accurate Cray X-MP (which is mostly just a Cray-1 with 24-bit addresses instead of 22-bit), and finally into a binary-compatible and not-at-all cycle-accurate Cray Y-MP/C90/J90 core (which is mostly just a Cray-1 with 32-bit addresses). During the final transition, it got a full re-write in System Verilog and a vast simplification in the hopes I’ll someday actually be able to boot up UNICOS on it. That hasn’t quite happened yet, but it’s far enough along that I felt it was ready to play around with and make it into a ‘real’ project. What to do with this? A Cray Wristwatch, of course!

A benefit of the extremely slow pace and long duration of my Cray-related work is the ongoing march of Moore’s law in the background. Very capable FPGA boards are now cheaper and smaller, and it seemed time to give another scale model Cray Supercomputer a shot. As I mentioned, this one isn’t quite as cycle-accurate (or CPU-count accurate), but I’m still quite happy with how it turned out.

But . . . Why??

I had a tiny Cray supercomputer and a round OLED display just looking for a project, and a Cray C916 just happens to be the right shape for the supercomputer wristwatch you’ve always dreamed about. If you wanted something practical, go read someone else’s blog.

How is this different from the Cray-1?

The Cray “Parallel Vector Processor” (PVP) line that extended from the Cray-1A in 1976 all the way to the final Cray X1E in 2005 really did maintain quite a bit of compatibility. The main differences between them were the extension from 22 address bits with the original Cray-1 to 24 bits with the X-MP and 32 bits with the X-MP/EA and Y-MP machines. As Cray PVP machines are only 64-bit word addressable (bytes? what are those?), 32-bit addresses give you access to a respectable 32 gigabytes of memory.

Later machines also switched from 64-bit “Cray” floating point to IEEE double-precision 64-bit floating point and added a ‘bit matrix multiply’ functional unit, but aside from that there were architecturally few differences across ~30 years of development. So programming my Cray J90 core is pretty much identical to the Cray-1, ignoring the address widths (this means the actual Address Registers were extended from 24 bits on the Cray-1 to 32-bits in my J90 core).

This was a from-scratch re-write with the intent of it being both debuggable and maintainable by me though, so I changed quite a few of the micro-architectural details with those goals in mind. Vector chaining is gone, the vector registers are now implemented in a big 2R1W SRAM, and the instruction cache maps much more nicely to the available block RAMs in my FPGA. Generally cleaner, simpler, and a little slower (but I did manage to get the clock all the way up to the 105 MHz that the real J90 ran at!).

System Architecture

The heart of the system is a Diligent CMOD-A7 FPGA board. The FPGA contains the Cray CPU core running at a respectable 105 MHz (actually matching the clock speed of the J90, although still fairly far from the 244 MHz of the real C90), a few KW of SRAM and a pair of DMA channels connected to a SPI interface via FIFOs. A Teensy 3.6 microcontroller serves as the ‘Front-End’ processor, and controls the reset signals and the SPI interface going to the J90. It also drives a neato circular OLED display I picked up (the previous generation of this one, I think). In practice, the Teensy dead-starts the J90 by initializing its memory via SPI and taking it out of reset, and then it just dumps frames of data from the J90 DMA engine, converts them to an appropriately formatted image and dumps them to the display.

Software

My work with supercomputers both modern and historical has made me quite fond of N-body gravity simulations – to show off the vector prowess of my J90 core, my wristwatch runs a full n-body simulation of Jupiter and 63 of its moons. That’s right! Each body gravitationally interacts with all 63 other bodies. Remarkably, through effective use of vector registers, the whole simulation requires only 127 carefully-chosen instructions, and is able to sustain 40 MFLOPS!

As I still don’t really have a proper development environment, I first wrote my program in Python using some simple primitives mimicking the Cray vector instructions. I then modified that to spit out some simple machine code, and ran it on my RTL simulator verify that nothing blew up numerically. At that point I was able to boot it on the real FPGA board and run it for a few hours to verify the stability.

But how do you tell the time?

With great difficulty. The display shows a free-running simulation of Jupiter and 63 of its moons. For convenience, I just plot the X/Y coordinates of each moon in the ecliptic plane. The ephemerides come from the HORIZONS server that NASA operates, at a specified date and time. The J90 just dumps a new frame whenever the Teensy has pulled the previous one, so with a teensy (ha!) bit of calibration on the micro controller side, it would be pretty easy to have the frames dumped in ‘real time’, which, knowing the starting time and date, would allow you to not-at-all-easily infer the current time by looking at the positions of Jupiter’s moons. Which is exactly what I was going for with my Supercomputer Wristwatch – using it should be as incomprehensible as my motivation for creating it in the first place!

Why a C90?

A 25:1 Replica of my wristwatch

Although it does maintain binary compatibility, the real Cray Y-MP C916 was a 244 MHz, 16-CPU monster – my 1/25-scale version is a mere single-CPU running at 105 MHz (and lacks other performance enhancements like dual-issue vector pipelines) . . . this model was mostly chosen because of the convenient circular dome in the middle that could hold my nifty OLED watch face.

I designed the model and built it with a 3D printer, carefully trimmed my circuit boards to fit, mounted the display, and still had enough room for a battery left over.

The Final Product

I love how this project turned out – it’s adorable, it’s programmable, and it pushes the boundaries of uselessness and complexity. The final case can accommodate a NATO-style wrist strap and has a built-in battery charger, but in practice it makes a better desk novelty than a watch.

The final (very trimmed down) board
A densely packed C90.
The final product