Big Data has truly come of age in 2013 when Oxford English Dictionary introduced the term “Big Data” for the first time in its dictionary. That of course begs the question ‘When was the term Big Data first used and Why?’. My curiosity led me to lot of research material but I relied mostly on Mr. Gil Press’s “A Very Short History of Big data” from Forbes, Mr. Steve Lohr’s “The Origins of ‘Big Data’: An Etymological Detective Story“ from The New York Times, Mr. Mark van Rijmenam’s “A Short History Of Big Data” from DataFloq.com, and “The OED, Big Data, and Crowdsourcing” from WhatstheBigData.Com articles. Source links are at the bottom.
Before I started researching into this article, I asked myself if it’s worth the effort to know who coined Big Data and why. I answered it in the affirmative because history always provided me a proper framework to understand and analyze if a trend is for real or just a fad. With so much buzz around Big Data, I needed to figure out the origins of Big Data and the context in which it came about. So here we go.
I’ve reviewed the Big Data origins from two angles, one is from the first time use of the term ‘Big Data’ itself and the other from the first time use of ‘Big Data’ referring to its modern definition i.e. information explosion and large sets of data (as outlined in my first article ‘What is Big Data’). The New York Times article presents a fascinating ‘Whodunit’ detective story of how they discovered the person who coined the term ‘Big Data’.
My version of the ‘whodunit’ starts in 1944 (yup right in the middle of World War II) in a university library in Connecticut. Following infographic is a quick summary and the story follows right below the info graphic.
1944, Fremont Rider, Wesleyan University Librarian, “The scholar and The future of Research Library” (book)
Mr. Fremont Rider, based on his observation in 1944 that American University libraries are doubling every 16 years, speculated in 1944 that the Yale Library in 2040 will have
“approximately 200,000,000 volumes, which will occupy over 6,000 miles of shelves…[requiring] a cataloging staff of over six thousand persons.”
Obviously, Mr. Rider didn’t predict the digitization of libraries but he accurately predicted the information explosion.
From 1944 to 1980, there were many articles and presentations that observed the ‘information explosion’ and the need for storage capacity.
1980, Sociologist Charles Tilly, “The old new social history and the new old social history”
Oxford English Discovery folks discovered that Mr. Tilly was the first person to use the term Big Data in this sentence in his article,
“none of the big questions has actually yielded to the bludgeoning of the big-data people.”
Whatsthedata.com editors argued that Mr. Tilly’s use is not in the context of what we understand Big Data to be today and I agree with them.
1990, Peter Denning, “Saving All the Bits”, American Scientist
Mr. Denning, in my opinion, introduced the notion of what is possible with Big Data in these following sentences.
“The imperative [for scientists] to save all the bits forces us into an impossible situation”
and he went on to say
“It is possible to build machines that can recognize or predict patterns in data without understanding the meaning of the patterns. Such machines may eventually be fast enough to deal with large data streams in real time.”
Mr. Denning truly nailed what is possible with Big Data way back in 1990. Impressive!
Now we fast forward to 1997-1998 when we begin to see the actual use of the term Big Data in its modern context.
1997, Michael Cox and David Ellsworth, “Application-controlled demand paging for out-of-core visualization”
Mr. Cox and Mr. Ellsworth’s article is the first article in the ACM digital library to use the term “big data” as follows.
“Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”
1998, John Mashey, Chief Scientist at SGI “Big Data… and the Next Wave of Infrastress.”
Here’s where the plot thickens. The New York Times article credits Mr. Mashey with the first time use of the term ‘Big Data’. Even though Michael Cox and David Ellsworth seem to have used the term ‘Big data’ in print, Mr. Mashey supposedly used the term in his various speeches and that’s why he is credited for coming up with Big Data. In Mr. Mashey’s own words,
“I was using one label for a range of issues, and I wanted the simplest, shortest phrase to convey that the boundaries of computing keep advancing.”
2000, Francis Diebold, “’Big Data’ Dynamic Factor Models for Macroeconomic Measurement and Forecasting
As per the Times article, Mr. Diebold thought (may be tongue in cheek) that he was the first one to use Big data term until Times gave the credit to Mr. Mashey. Nevertheless, Mr. Diebold should get some credit for his prescience is stating that
“Recently, much good science, whether physical, biological, or social, has been forced to confront—and has often benefited from—the “Big Data” phenomenon. Big Data refers to the explosion in the quantity (and sometimes, quality) of available and potentially relevant data, largely the result of recent and unprecedented advancements in data recording and storage technology.”
Mr. Diebold for the first time linked Big Data explicitly to the way we understand the term today.
2001, Doug Laney, Meta Group (Gartner), “3D Data Management: Controlling Data Volume, Velocity, and Variety.”
Ah, the famous Three ‘V’s. I referred to this in my first article here. I don’t have to say much beyond that.
2005, Tim O’Reilly, What is Web 2.0?
2005 is the year in which Tim O’Reilly published his groundbreaking article ‘What is Web 2.0?’ that set off the Big Data race. Additionally, Mark from Datafloq.com states that Roger Mougalas from O’Reilly Media explicitly used the term ‘Big Data’ to refer to a large set of data that is almost impossible to manage and process using traditional business intelligence tools. This is definitely the current widely understood form of Big data definition.
2005 is also the year that Hadoop was created by Yahoo! which is built on top of Google’s MapReduce. As many of you know, nowadays the open-source Hadoop is used by a lot organizations to crunch through huge amounts of data. More on this in my subsequent articles.
So you can say, 2005 is the year that the Big data revolution has truly begun and rest, as they say, is history.
So what other Big Data questions you have but are afraid to ask? Go ahead and comment below.