Coincidence, Colussus and Fish
The BBC reports that a rebuilt Colossus is cracking codes again at Bletchley Park.
COLOSSUS was the world's first programmable electronic computer, so this is considered something of an achievement. However, COLOSSUS was really only built to do one job.
Read on for details.
Message integrity
The BBC article notes:Speaking to the BBC, Andy Clark, one of the founders of the Trust for the National Museum of Computing, said radio problems had stopped the challenge getting under way on time. "The radio path has not been particularly good between Germany and here," he said. "We are at a bad point in the sunspot cycle."This highlights one of the trickier problems in information theory: ensuring that a message gets from Alice to Bob without loss. In the digital world, we use redundancy coding: Sending as small an amount of redundant information as necessary to at least detect message corruption (e.g. checksums, hashes, digests). If Bob finds an error in Alice's message, he can send an automatic repeat request (ARQ), and Alice can repeat the message. Some fancier schemes attempt to correct the error. This is useful in situations where ARQs are infeasable or otherwise undesirable. One example is real-time signals (e.g. mobile phones), where a retransmission would come too late. Another is stored media, such as CDs and DVDs. An exotic application is deep space probes: The last telemetry signal from Pioneer 10, for example, was at about 80AU. A round-trip signal would therefore take 21 hours or so, and that's assuming that the probe could even hear you at that distance. So if you're ever designing spacecraft, you really want to avoid ARQs if you can. If it's not important that you get all of the information just so long as you get enough of it, then another option is to use synchronising codes. If a symbol does not depend on the symbol before it, then you can drop symbols (and timing can tell you that they've been dropped), and you'll get at least some of the message. So while you may not (deleted) everything, the (deleted) can be often be (deleted) from the context.
Old-School Cryptography
The same thinking applied to cryptosystems, in the era of World War II. As the BBC article indicates, radio was a notoriously unreliable transmission medium. It was therefore very important that some kind of protocol was established for messages that didn't entirely make it. And this was before portable automatic computation devices, so it was only operators who could tell if a message made it or not. If a message was broadcast (single transmitter, multiple receivers), then one sensible approach was retransmission: send the message more than once, to make sure that everyone got it. If it was point-to-point (Alice to Bob), then ARQs were possible. However, cryptographers also designed ciphers in such a way that if certain letters were corrupted, only that letter would be corrupt in the decoded message. Sijgle lettzr errqrs are anpoying, bwt mesxages ara ofken stivl legidle. There's also a deep mathematical sense in which more true of German than English. But we'll get to that in a moment. The most sophisticated World War II-era ciphers were polyalphabetic stream ciphers. Essentially, you had a pseudo-random stream of substitution alphabets, which didn't depend on the message to be encoded in any way. You then used the substitution alphabets to encode the message. There were some messy details related to key negotiation (which is still messy, even in modern cryptography), but that's the general idea. Probably the most famous example is ENIGMA. It used stepping rotors, reflectors and plugs to form what was, essentially, a simple substitution cipher, where the substitution alphabet changed for every letter. The sequence of alphabets did not depend on the message, so if a letter was misheard, only that letter would be corrupt in the final message. These stream ciphers suffer from one gaping security hole: Reusing a key sequence for two different messages breaks the system wide open.Coincidence Counting
OK, now for some maths. Suppose, for simplicity, that you have a 26-symbol alphabet, [tex dpi=72]\{ A \ldots Z \}[/tex]. Suppose that the probability of symbol [tex dpi=72]Q[/tex] appearing is fixed, and we'll call it [tex dpi=72]p_Q[/tex]. Naturally, the probabilities sum to one: [tex]\sum_{x=A}^{Z} p_x = 1[/tex] Consider the following problem: What is the probability that any two symbols chosen at random are the same? A moment's thought shows that it's just this: [tex]\kappa = \sum_{x=A}^{Z} {p_x}^2[/tex] We call this quantity the index of coincidence (or IC) for the probability distribution. You can estimate the IC for a distribution by looking at a stream of symbols that follows the distribution. Consider, for example, the following text:YOUCA NESTI MATET HEICF ORADI STRIB UTION BYLOO KINGA TASTR EAMOF SYMBO LSTHA TFOLL OWSTH EDIST RIBUT IONCO NSIDE RFORE XAMPL ETHEF OLLOW INGTE XTThe message contains 8 A's, 4 B's, 3 C's and so on. Denote the number of instances of A as [tex dpi=72]n_A[/tex], and the length of the message as [tex dpi=72]n[/tex]. The number of pairs of symbols in the message is given by [tex dpi=72]n(n-1)/2[/tex]. Similarly, the number of pairs of A's is [tex dpi=72]n_A(n_A-1)/2[/tex]. It follows that the IC for the message as a whole is: [tex]\kappa_m = \frac{\sum_{x=A}^{Z} n_x (n_x - 1)}{n (n-1)}[/tex] Which comes out at approximately 6% for this message. The reason why the IC is interesting for cryptographers is that it is invariant under a number of common transformations. If you apply a simple substitution cipher to a message, for example, or permute the symbols arbitrarily, the measured IC of the message is the same. It's also approximately the same if you take a random sample of the symbols from the message. Moreover, the IC is approximately constant for a given language. English messages have an IC of around 6.6%, and German messages have an IC of around 7.8%. Compare this with the IC for a 26-symbol alphabet where the probabilities are equal (i.e. random noise), which is 1/26 or 3.8%. Friedman's Law states that for natural Roman alphabet-based languages, the ratio of the IC for the language to the IC for a flat frequency distribution is approximately 2. (For English it's 1.73, and for German it's 2.05.) One particularly interesting property of this is that you can often tell if you've "almost" cracked a cipher by looking at the IC. If it's close to 3.8%, you're not there yet. If it's close to 6.6%, then you might still need to undo a permutation and substitution, but you know the message is probably in English. Indeed, it's sometimes easier to tell what language some cipher is written in than to decode the message. Stream ciphers have the property that given the same position in the same stream, two plaintext symbols are encrypted to the same ciphertext symbol. It follows that if you take two different messages, written in the same language, encrypted using the same key sequence, then the proportion of symbols which match at the same positions should be roughly the same as the IC for the language. Here's an example. Two messages encrypted using the same stream cipher using the same key sequence. Message 1:
MUAOQ OPOAY GQTSS YDJEL QQNLS ZPZVS XGTOI YLCYN BKBKU CJNAT IQXPE LMSKM KPRJF GDWBA WCRYZand message 2:
XYPND HJMSM MGKSP NZYTC WDUUW VTGVV XRJSK FSYJR BNFUB IQGBH OCPIV MIUBD QHOSZ JTEFF HFUJXThe measured IC of message 1 is 3.3%, and of message 2 is 3.4%: both close to what you'd expect from a flat distribution. But if you line them up against each other, and look for places where the ciphertext letters are identical:
MUAOQOPOAYGQTSSYDJELQQNLSZPZVSXGTOIYLCYNBKBKUCJNATIQXPELMSKMKPRJFGDWBAWCRYZ
XYPNDHJMSMMGKSPNZYTCWDUUWVTGVVXRJSKFSYJRBNFUBIQGBHOCPIVMIUBDQHOSZJTEFFHFUJX
^ ^ ^ ^
The messages are 75 symbols long, and there are 4 coincidences. That gives an estimated IC of 4/75 = 5.3%. That's not a great result because the message is so short, but it's certainly closer to a natural language than to random text, so this is good evidence that the two messages were encrypted with the same stream cipher.
If you're curious, I encrypted them with an ENIGMA emulator. In the very early 1930s, mathematicians at the Polish Cipher Bureau noticed that if you took two German ENIGMA messages, intercepted on the same day, which began with the same six letters, the IC of the rest of the messages (i.e. after the first six letters) was very close to that of German. They concluded that the first six letters specified the key, and the rest was a polyalphabetic stream cipher. This was the first break in ENIGMA. But that's another story.
We're more interested in TUNNY.
TUNNY
While the German millitary used ENIGMA for field traffic (e.g. to and from U-boats), high-level traffic (e.g. between command centres) was typically sent by teleprinter. The German teleprinters used a 5-bit binary code called Baudot code (which was also famously used on the cover of a Coldplay album, X&Y). Interestingly, modern PC serial ports can still be configured to handle 5-bit teleprinter-compatible messages, if you ever feel the need to communicate in Baudot. The message was encrypted with two machines: The Siemens und Halske T52, and the Lorenz SZ 40 (later replaced with the SZ 42). The Bletchley Park operators referred to the teleprinter traffic as "FISH" and the Lorenz-encrypted messages in particular as "TUNNY". In August 1941, a German operator made a serious error. A 4000-character message was sent from Vienna to Athens. The recipient sent an unencrypted ARQ (this let the eavesdroppers knew what was going on). The first operator re-sent the message using the same key sequence (which was absolutely forbidden, but he did it), but instead of starting the message with "SPRUCHNUMMER" ("message number"), he keyed in "SPRUCHNR" (there were other differences; the retransmission contained less word spacing and punctuation than the first). The cryptanalysts now had two almost identical messages which they knew were encrypted with the same key sequence. From this break alone, John Tiltman managed to separate the message from the key stream, and W.T. Tutte managed to reconstruct the entire machine.HEATH ROBINSON and COLUSSUS
Tutte noticed a pecuiliarity of the Lorenz machines, that taking the exclusive or of consecutive symbols left statistical regularities in the bit patterns, which could be discovered using the IC on the bit stream. The prototype machines used to compute this were called HEATH ROBINSON, after the British cartoonist, and the final machines were called COLOSSUS. [As an aside, W. Heath Robinson's cartoons of machinery are as famous in Britian as Rube Goldberg's are in the US, and engineers use the names to refer to any machine design which is bad in an amusing way. The main difference is that a Rube Goldberg machine is unnecessarily complex, whereas a Heath Robinson machine is cobbled together from junk.] Bletchley Park received TUNNY intercepts punched onto paper tape, which was formed into a loop. Loops of "key tape" were also pre-prepared, which emulated some aspect of the Lorenz key streams. The two tapes were then run against each other, and COLOSSUS did an IC measurement to determine how alike the message was compared with the key tape. The key tape was then "stepped" by one position compared to the message, and the process repeated. The position with the greatest IC was then determined to be the likely wheel setting corresponding to that key tape. You can think of it by analogy to phase-locked loop. You want to find where in the key stream the message falls, so you vary the relative phase of the message and the key stream until you find the most agreement (i.e. the greatest IC). At that point, you've locked onto the phase, and you can decrypt the message.The challenge
COLUSSUS was designed, as I mentioned, to do one job. It was designed to do its computations in an embarrassingly parallel manner, and as a result, it was extremely fast, even by modern standards. With the tape moving at full speed, it could handle about 25kbps of I/O (which is only about half the speed of an ISDN line), and process it at the same rate. That's an amazing achievement, when you think about it, especially when you consider that it was really only the third genuinely new electrical computer in the world. The closest that Germany had at the time, the Zuse Z3, only had a clock speed of 5-10Hz. But then, the Z3 was Turing-complete. As such, it will be interesting to see who wins the challenge: the rebuilt Colossus or the team using modern PCs. My money is on the PC team, but unusually for a first-generation computer, it's a fair fight.0 TrackBacks
Listed below are links to blogs that reference this entry: Coincidence, Colussus and Fish.
TrackBack URL for this entry: http://andrew.bromage.org/mt/mt-tb.cgi/89

Leave a comment