You accidentally shredded an important document. Could you put it back together? Would it be worth the effort? What if it would stop a terrorist plot, or release an innocent person from jail?
Intelligence agencies around the world face questions like these, most famously when the East German secret police abandoned tons of torn pages after the fall of the Berlin Wall in late 1989. Those documents are still being reconstructed, and the creation of a computer program called the e-Puzzler in 2010 increased hopes of finishing the job by 2013. But that program was written to reconstruct those particular documents. It is optimized to handle hand-torn pieces as the East German secret police were forced to manually destroy many of the documents.
To seek “unexpected advances” that could be applied to such situations, the U.S. Defense Advanced Research Projects Agency (DARPA) issued its Shredder Challenge in October 2011, daring the public to unshred five documents and extract their hidden messages. The puzzles were designed to be progressively difficult on multiple axes, ranging from about 200 chads to more than 6,000 each. Text files accompanying the scanned chads provided questions, with points reflecting each question’s difficulty. For example, puzzle #3 asked “What is the indicated location?” while the reassembled document showed a set of geographic coordinates and a drawing of Cuba. (Naming the country was worth two points; the city of Cienfuegos was worth an additional six.) Solvers needed to both answer the questions and show how their reconstruction of the document led to that answer. More than 9,000 teams applied for the challenge, but only 69 of them answered one or more questions correctly.
The winning team was the only one to answer all of the questions in all five puzzles, taking home the $50,000 prize while ending the contest on Dec. 2, 2011, three days ahead of schedule. Like other contest leaders, the eight-person “All Your Shreds Are Belong To U.S.” team employed original computer algorithms to narrow the search space, and then relied on human observation to move the pieces into their final positions. In the process the All Your Shreds team discovered unexpected difficulties—and also a peculiarity in the contest that gave it an unexpected edge.
Play to Win
The contest’s parameters had a strong effect on how competitors approached the problem. As Don Engel of the two-person, second-place “Schroddon” team says, “If we had taken a completely manual approach, we’d lose because teams can be of any size. If we’d taken only a computer-driven approach, we’d also lose because there are people out there who are better at computer vision.” In fact, a team from University of California, San Diego attempted to crowdsource the challenge, setting it up as a sort of game and inviting the public to help them via the Internet. Saboteurs foiled the team’s strategy, accusing it of “cheating” and moving shreds to incorrect positions. The team responded by implementing security measures, which also slowed its progress, knocking it from third to sixth place.
The winning All Your Shreds team took advantage of the fact the shredded documents were apparently photocopied before being placed face-up on a solid background and scanned, resulting in a phenomenon known as printer steganography. “High-quality printers and photocopiers, in the United States at least, have a watermark on them so the Secret Service can track phony money,” explains team member Otávio Good, founder of Quest Visual. “It’s a repeating pattern of yellow dots. If you can find it, you can use that to your advantage. It’s essentially like a map of how the pieces go together.” However, Good pointed out they were “yellow dots on yellow paper, and the chads were small. It helped us tremendously, but we were pretty much in the lead even before we found the dots. In my opinion, we were still on track to win… it was part of the game.”
“The challenge and its component puzzles were designed to resemble the problem facing an intelligence analyst,” says Norman Whitaker, deputy director of DARPA’s Information Innovation Office. “That analyst is often less interested in putting the pieces together than in getting crucial information off the page in the shortest time possible.” In explaining why the challenge could only be a small representation of what is found in the field, he says, “[It’s] narrow in scope to make it tractable. It concerned English, handwritten documents shredded with a commercial shredder. Additional research and development would be required to apply this work to actual confiscated documents.” Even then, he defined the problem of document reconstruction as “just one facet of information assurance, an aspect that is often overlooked.”
Other contest features played unexpected roles in defining the winning solutions. The Schroddon team found the shredders sometimes sheared the paper in its depth. As team member Marianne Engel notes, “The computer looked at [the edges of such chads] as though they were blank areas, but the human eye could see that the chads were just torn.” In addition, the five puzzles varied in the type of shredder used; the number and size of pages; the documents’ content; and whether the pages were on lined, unlined, or graph paper. As a result, says Good, “we essentially hard-coded our program to work for each individual puzzle. We didn’t create a general ‘unscramble the puzzle’ codebase.”
One method that helped Good’s team was the use of a scoring system that improved its algorithms. “As we adjusted our algorithms, we could see whether they were working,” says Good. “At that point the efficiency of our recommendation algorithm went up by a factor of 10 within a few hours. So although each puzzle started out difficult, it got easier as we went.”
The Human Element
One fact emerged early on in the contest: The puzzles would not be solved by computers alone. Matthias Prandtstetter, a researcher at the Austria Institute of Technology who investigated document reconstruction while at the Vienna University of Technology, acknowledged that much of the scholarly literature on the subject stops where human interaction takes over, thereby failing to address certain real-world issues. For example, “[computer algorithms] might be able to reconstruct a document that’s in multiple columns or a table, but not have them in the right order,” he says, while a human could finish the job by understanding each portion in context. Prandtstetter suggests language-recognition algorithms could extend those for reconstruction, but believes the pairing of these technologies is in its infancy.
Specialized knowledge also came into play, for example, as Don Engel of the second-place Schroddon team found he was able to apply his master’s work in statistical natural language processing to the task. “One mind-set I took from my studies was that of conditional probability: When you have a few characters, what are the chances that the next one with be a specific character?” says Don Engel. “So as we were assembling shreds, I grepped the dictionary file on my computer to find words with those characters in sequence. Also, for puzzle #5 the question was, ‘What are the three translated messages?’ That implied that they were either in some foreign language or encoded. I saw a very large number of the letter ‘d’ and thought, ‘That’s way more than you’d see in English.’ The same was true with certain pairs of letters, such as ‘di’ and ‘ah’ and ‘it.’ As it turned out, the answer to puzzle #5 was in Morse code, written as ‘dit dit dah.'”
The problem of document reconstruction is an often-overlooked aspect of information assurance, says DARPA’s Norman Whitaker.
Part of the All Your Shreds solution involved team members remotely reviewing the computer’s piece-pair recommendations via the Internet, then manually making selections and placing pieces. Knowing they would ultimately spend hundreds of hours in this pursuit, its programmers created a custom client to make the interface as fast as possible. “We wanted things to be very user-friendly because the human was a big element,” says Good. “When we clicked on a recommendation, we wanted that to happen in real-time—maybe one-tenth of a second on a fast computer. We set up a node.js server to allow team members to place puzzle pieces over the Internet, even while the programmers worked on algorithms. There were about eight bottlenecks in the program, but because we wrote the client in C#, we were able to use C#’s parallel libraries to get past them by parallelizing across all our processors.”
Indeed, the subject of document shredding is one that measures human value against human trouble. We shred documents when we believe their value is greater than the reconstruction cost. As long as that cost involves human intervention—a point the Shredder Challenge did not disprove—that equation still holds. As Don Engel says, “If you look at the last 6,000-piece puzzle, that’s 6,000 times 6,000 possible pairs, which is just intractable for a human. So we needed a computer to filter for pairs that were plausible. But then, we could use our human brains to determine which ones made sense.”
Communications of the ACM