Enter the maze

Byzantine birthdays

You may not think of computers as argumentative, but some of them do actually bicker quite a lot, and for good reason. Many hi-tech systems we depend on rely on the right computers getting their way, and the problems involved are surprisingly tricky to get right. Barbara Liskov’s contributions to this fiendishly difficult problem of making sure the right computers win their arguments helped her scoop the world’s top computing prize in 2009.

Two violins embracing

The ACM Turing Award, which Barbara won, is the computing equivalent of a Nobel Prize. She was awarded it for more than 40 years' work. Her early work on programming languages has been used in every significant new language that has been invented since. More recently she has moved on to problems about 'distributed systems': computer systems involving lots of separate computers that have to work together. That's where the arguing comes in.

Barbara has been working on an area of distributed computing called 'Byzantine fault tolerance'. It's all about providing a good service despite the fact that just about anything can go wrong to the computers or the network connecting them. It's so important because those computers could be doing anything from running an e- commerce site over the Internet to keeping an airliner in the air.

Happy birthday

Here's a thought experiment to show you how tricky distributed computing problems can be. Alice is a cellist in an orchestra and it turns out that the birthday of the conductor, Cassie, is on the day they are playing a concert. Jo, the conductor's partner, has asked Alice to arrange for the orchestra to play Happy Birthday mid-concert as a surprise. The rest of the orchestra were enthusiastic. The only trouble is they didn't have the chance to agree which break to do it in. In fact, no one's sure they're definitely doing it at all, and now they are now all sitting in their places ready to start.

For it to work they all have to start precisely together or it will just sound horrible. If anyone starts on their own they will just look and sound silly. Can it be done, or shouldthey all just forget the whole thing? ou may not think of computers as argumentative, but some of them do actually bicker quite a lot, and for good reason. Many hi-tech systems we depend on rely on the right computers getting their way, and the problems involved are surprisingly tricky to get right. Barbara Liskov's contributions to this fiendishly difficult problem of making sure the right computers win their arguments helped her scoop the world's top computing prize in 2009.

Alice decides to go ahead. She can tell the others it's on by whispering to those next to her and telling them to pass the message on. As long as her message gets to enough people across the different sections it will be ok. Won't it?

Actually no.

The problem is: how will she know enough people did get the message? It has to be passed when people aren't playing, so some could presumably not get it in time. How will she know? If the whispers didn't get through and she starts to play, she will be the embarrassed one.

Are you in?

Marbles forming a star network

That seems an easy thing to solve though - when each person gets the message they just send one back saying, "I'm in". If she gets enough back, she knows it's on, doesn't she? Ahh! There the problem lies. She knows, but no one else can be sure she knows. If any are in doubt they won't play and it will still go horribly wrong. How does she let everyone know that enough people are willing to go for it? Alice is in the same situation she was at the start! She doesn't know it will happen for sure and neither does anyone else. She can start whispering messages again saying that enough people have agreed but that won't help in the end either. How does she know all the new messages get through?

Change the problem

A computer scientist might have a solution for Alice - change the problem. Following this advice, she starts by whispering to everyone that she will stand up and conduct at an appointed time. Are they in? Now all she needs to be sure of is that when she stands up, enough people have agreed to play so that she won't look silly. The others send back their message saying 'I'm in', but no one else needs to know in advance whether the song is definitely going ahead. If she doesn't stand up they won't play. If she does, they go ahead.

Delivering a good service

Byzantine fault tolerance is about designing this kind of system: one that involves lots of 'agents' (people or computers) that have to come to an agreement about what they know and will do. The aim is for them to work together to deliver a service. That service might be for an orchestra to play Happy Birthday, but is more likely to be something like taking airline bookings over the Internet, or even deciding on action to take to keep the airliner they are flying in the air. The separate computers have to agree as a group even when some could stop working, make mistakes due to software bugs or even behave maliciously due to a virus at any point. Can a way be engineered that allows the system as a whole to still behave correctly and deliver that service? This is the problem Barbara Liskov has been working on recently with Miguel Castro at MIT.

It's called Byzantine Fault Tolerance after some imaginary generals in the ancient days of the Eastern Roman Empire, whose capital was Byzantium. The original problem was about how two generals could know to attack a city at the same time.

Of course they weren't thinking of birthdays and orchestras. They were interested in providing the service of looking after documents so they can be accessed anytime, anywhere. A simple computer does this with a file system. It keeps track of the names of your documents and where it has stored them in its memory. When you open a document it uses the records it has kept to go and fetch it from wherever it was put. With this kind of file system, though, if something goes wrong with your machine you could lose your documents forever.

Spread it around

A way to guard against this is to create a file system that distributes copies of each file to different computers around the Internet. When you make changes, those changes are also sent to all the other computers with copies. Then if the copy on one machine is corrupted, perhaps by a hacker or just by a disk crash, the file system as a whole can still give you the correct document. That is where the computers need to start arguing. When you ask for your document back how do all the computers with (possibly different) copies decide which is the correct, uncorrupted version? That sounds easy, but as the orchestra example shows, as soon as you create a situation where the different agents (whether human or computer) are distributed, and worse you can't trust anyone or anything for sure, there are lots of subtle ways it can all go horribly wrong.

Laptops connected wirelessly

The way Barbara and Miguel's solution to this problem works is similar to what Alice was doing. One computer acts as what is called the 'primary' (Alice played this role). It is where the request from the client (Jo) goes. The primary sends out a request to messages. All the backups reply with their copy of the document. As soon as more than some predetermined number come back with the same document, then that is the good copy.

Not so simple

Of course the detail of Barbara and Miguel's method is a lot trickier than that. They've had to figure out how to cope if something goes wrong with the primary (Alice herself) to ensure that the client still gets their document. Their version also works without any synchronisation to make things happen in lockstep (suppose Alice is at the back so can't stand up and conduct to keep things in time). There are lots of other details in Barbara and Miguel's version too. Messages are time- stamped, for example, so that the recipients can tell if a message is new or just a copy of an old one.

Practically perfect

The particularly special thing about Barbara and Miguel's way of providing fault tolerance, though, is that it doesn't take an excessively long time. Various people had come up with solutions before, but they were so slow no one could really use them. The new method is so fast it's almost as if you weren't bothering about fault tolerance at all. Better still, the fact that it doesn't need synchronisation - no conducting - means it also works when the replicated services are on the Internet where the computers act independently, and there is no way to keep them in lockstep.

Barbara's work might never actually help with an orchestral surprise like Alice's, However, because of it future computers will be in a better position to shrug off their own kind turning rogue due to hackers, cosmic rays or even just plain old breakdowns. Not a bad reason to have a byzantine argument.