Google’s first public foray into augmented reality began with an argument in a bar. It was 2008, and David Petrou, a longtime Google engineer, was sitting in a Brooklyn watering hole explaining to his friends how someday, you’d be able to do a search just by pointing your phone’s camera at something. He likened it to pointing and asking, what’s that? It would be faster and richer than typing, and could help you ask questions you’d never be able to put into words. Based on what he’d seen within Google, Petrou thought the tech could already work. His friends, of course, said he was crazy. They thought computer vision was science fiction.
Petrou left the bar early and angry, went home, and started coding. Despite having no background in computer vision, and a day job working on Google’s Bigtable database system, Petrou taught himself Java so he could write an Android app and immersed himself in Google’s latest work on computer vision. After a month of feverish hacking, Petrou had the very first prototype of what would soon become Google Goggles.
Petrou still has a video of an early demo. He’s crammed into a Google conference room with Ted Power, a UX designer, talking into a webcam. Before he starts, Petrou explains what he’s working on. “The idea is generic image annotation, where an image can come in to Google and a number of back-ends can annotate that image with some interesting features.” Crystal clear, right?
To explain, Petrou grabs a G1, Google’s then-new Android phone, and takes a photo of a newspaper article about Congressional oversight of ExxonMobil. A moment later, the phone spits back all the article’s text, rendered in white text on a black background. It looked like a DOS prompt, not a smartphone app, but it worked impressively—except near the end when it spelled the company Em<onMobile. A few minutes later, Petrou showed a terribly lit photo of Power’s desk, littered with books and cables, with a MacBook in the center. The app surveyed the image and returned 10 terms to describe it. Some made sense, like “room” and “interior.” Others, like “Nokia,” less so. Two terms particularly excited Petrou: “laptop” and “MacBook.” They showed this camera could see objects, and understand them. But still, almost immediately after, Petrou preached caution: “We are a long way to providing perfect results,” he said into the webcam.
The first versions of Goggles couldn’t do much, and couldn’t do it very well. But searching the web just by taking a photo still felt like magic. Over the next few years, Google Goggles would capture the imaginations of Google executives and users alike. Before Apple built ARKit and Microsoft made HoloLens, before anyone else began to publicly explore the possibilities of augmented reality, Goggles provided a crucial early example of how smartphones could interact with the real world.
Then Goggles died. The first great experiment in smartphone AR came and went before anyone else could even copy it.
Robin Williams used to joke that the Irish discovered civilization, then had a Guinness and forgot where they left it. So it was with Google and smartphone cameras. Nearly a decade ago, Google engineers were working on ideas that you’ll now find in Snapchat, Facebook, the iPhone X, and elsewhere. As the tech industry moves towards the camera-first future, in which people talk, play, and work through the lens of their smartphone, Google’s now circling back, tapping those same ideas and trying to finish what it started. This time it’s hoping it’s not too late.
Seeing Is Believing
When Petrou first started working on Goggles, he had no idea how many other Googlers were working on the same stuff, and how long they’d been at it. In 2006, Google had acquired a Santa Monica-based company called Neven Vision, which possessed some of the most advanced computer-vision tools anywhere. Google had a particular idea for where to deploy it: in its Picasa photo-sharing app. “It could be as simple as detecting whether or not a photo contains a person, or, one day, as complex as recognizing people, places, and objects,” Adrian Graham, Picasa’s product manager, wrote in a blog post announcing the acquisition. “This technology just may make it a lot easier for you to organize and find the photos you care about.”
After a couple of years, as Neven Vision’s tech integrated further into Picasa, founder Hartmut Neven and his team started to think a little bigger. “We were all inspired by the Terminator movie, when he walks into the bar and everything gets identified,” says Shailesh Nalawadi, a former product manager on the team and now CEO at Mavin, Inc. “We thought, ‘Hey, wouldn’t it be amazing if you could have something like that, match it against a database, and it would tell you what’s in that picture?'”
Eventually the Neven Vision crew met Petrou, and they started working on a better prototype. They built an app that could identify book covers, album art, paintings, landmarks, and lots of other well-known images. You’d take a picture, and after 20 seconds or so of uploading and processing, the app would return search results for whatever you were looking at. It was primitive, but it worked.
Lots of projects within Google start the same way: one person builds something, shows it around, generates enough excitement to get a few more people interested, and they contribute resources to build it out further. For the Goggles team, that happened easily. Almost everyone who saw the app walked away amazed by it. Two execs in particular became high-level champions of the idea: Vic Gundotra, a vice president of engineering, and Alan Eustace, a senior vice president of knowledge. They brought resources, energy, and ambition to Goggles. Googlers started to talk about how great it would be when the app was universal, when it could recognize anything and everything. “Everyone at Google understood that this was possible, this was familiar, and yet transformative,” Nalawadi remembers. “That we were on the cusp of this thing, and it could be done.” He likens it to self-driving cars: wild and futuristic, but also completely natural. Why shouldn’t you be able to point your phone at something and ask, what’s that? It felt inherently Google-y.