Source: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
This article was published in Wired magazine four years ago by Chris Anderson, the editor-in-chief, and it has been fairly influential. Chris Anderson pointed out that no scientific models are necessary and “correlation is enough”, and that in the context of massive amounts of data, “they [Google] don’t have to settle for models at all”.
"All models are
wrong, but some are useful." So proclaimed statistician George Box 30 years
ago, and he was right. But what choice did we have? Only models, from
cosmological equations to theories of human behavior, seemed to be able to
consistently, if imperfectly, explain the world around us. Until now. Today
companies like Google, which have grown up in an era of massively abundant
data, don't have to settle for wrong models. Indeed, they don't have to settle
for models at all.
Sixty years ago, digital computers made
information readable. Twenty years ago, the Internet made it reachable. Ten
years ago, the first search engine crawlers made it a single database. Now
Google and like-minded companies are sifting through the most measured age in
history, treating this massive corpus as a laboratory of the human condition.
They are the children of the Petabyte Age.
The Petabyte Age is different because more is
different. Kilobytes were stored on floppy disks. Megabytes were stored on hard
disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud.
As we moved along that progression, we went from the folder analogy to the file
cabinet analogy to the library analogy to — well, at petabytes we ran out of
organizational analogies.
At the petabyte scale, information is not a
matter of simple three- and four-dimensional taxonomy and order but of
dimensionally agnostic statistics. It calls for an entirely different approach,
one that requires us to lose the tether of data as something that can be
visualized in its totality. It forces us to view data mathematically first and
establish a context for it later. For instance, Google conquered the
advertising world with nothing more than applied mathematics. It didn't pretend
to know anything about the culture and conventions of advertising — it just assumed
that better data, with better analytical tools, would win the day. And Google
was right.
Google's founding philosophy is that we don't
know why this page is better than that one: If the statistics of incoming links
say it is, that's good enough. No semantic or causal analysis is required.
That's why Google can translate languages without actually "knowing"
them (given equal corpus data, Google can translate Klingon into Farsi as
easily as it can translate French into German). And why it can match ads to
content without any knowledge or assumptions about the ads or the content.
Speaking at the O'Reilly Emerging Technology
Conference this past March, Peter Norvig, Google's research director, offered
an update to George Box's maxim: "All models are wrong, and increasingly
you can succeed without them."
This is a world where massive amounts of data and
applied mathematics replace every other tool that might be brought to bear. Out
with every theory of human behavior, from linguistics to sociology. Forget taxonomy,
ontology, and psychology. Who knows why people do what they do? The point is
they do it, and we can track and measure it with unprecedented fidelity. With
enough data, the numbers speak for themselves.
The big target here isn't advertising, though.
It's science. The scientific method is built around testable hypotheses. These
models, for the most part, are systems visualized in the minds of scientists.
The models are then tested, and experiments confirm or falsify theoretical
models of how the world works. This is the way science has worked for hundreds
of years.
Scientists are trained to recognize that
correlation is not causation, that no conclusions should be drawn simply on the
basis of correlation between X and Y (it could just be a coincidence). Instead,
you must understand the underlying mechanisms that connect the two. Once you
have a model, you can connect the data sets with confidence. Data without a
model is just noise.
But faced with massive data, this approach to
science — hypothesize, model, test — is becoming obsolete. Consider physics:
Newtonian models were crude approximations of the truth (wrong at the atomic
level, but still useful). A hundred years ago, statistically based quantum
mechanics offered a better picture — but quantum mechanics is yet another
model, and as such it, too, is flawed, no doubt a caricature of a more complex
underlying reality. The reason physics has drifted into theoretical speculation
about n-dimensional grand unified models over the past few decades (the
"beautiful story" phase of a discipline starved of data) is that we
don't know how to run the experiments that would falsify the hypotheses — the
energies are too high, the accelerators too expensive, and so on.
……
There is now a better way. Petabytes allow us to
say: "Correlation is enough." We can stop looking for models. We can
analyze the data without hypotheses about what it might show. We can throw the
numbers into the biggest computing clusters the world has ever seen and let
statistical algorithms find patterns where science cannot.
......
This kind of thinking is poised to go mainstream.
In February, the National Science Foundation announced the Cluster Exploratory,
a program that funds research designed to run on a large-scale distributed
computing platform developed by Google and IBM in conjunction with six pilot
universities. The cluster will consist of 1,600 processors, several terabytes
of memory, and hundreds of terabytes of storage, along with the software,
including IBM's Tivoli and open source versions of Google File System and
MapReduce. Early CluE projects will include
simulations of the brain and the nervous system and other biological research
that lies somewhere between wetware and software.
Learning to use a "computer" of this
scale may be challenging. But the opportunity is great: The new availability of
huge amounts of data, along with the statistical tools to crunch these numbers,
offers a whole new way of understanding the world. Correlation supersedes
causation, and science can advance even without coherent models, unified
theories, or really any mechanistic explanation at all.
There's no reason to cling to our old ways. It's
time to ask: What can science learn from Google?
No comments:
Post a Comment