Understanding Big Data

Without getting too analytical about my statement, I like to think of “Big Data” as an all encompassing term describing humanity’s capacity to collect and analyze the numbers we generate in our lives, and those already existing in the cosmos.

In an earlier post called “Peace, Love, and Big Data” my colleague and friend, Amelia Vaughan (see below) highlighted some of the positive work resulting from the Big Date trend. Though the following post is also about the trend, I am going to try and build a bit on Amelia’s earlier post by looking at the meaning of Big Data, and then taking a look at some of the ways in which the data trend is being used and how it is impacting our lives.

As soon as the Big Bang gave birth to our universe, the numbers came flowing, waiting to be discovered. Since the dawn of our species, a creative/investigative organism, we’ve been gleaning data from our universe in order to understand it and our place in the world. Thus, data collection is really nothing new. In fact, it’s been an activity we have engaged in for millennia. So, given our extensive history of data collection, why is it that we now feel it is appropriate to say that we are in an era of “Big Data.” Honestly, it seems to me that many scientific disciplines have been living in an era of Big Data for centuries. Many would echo this sentiment  as they feel Big Data is nothing more than a business intelligence catch phrase conveniently putting a label on our current relationship to the technology/tools that facilitate our capacity to discover and generate vast amounts of information. Others feel that Big Data is a phenomenon worthy of special attention, seeing this time as a truly unique feature of the development of our species. And still others feel that whatever this thing is it really does not matter as long as their internet devices allow them to remain plugged into the world online. However you want to or not want to classify it Big Data as we know it can be boiled down to three things: Volume, Velocity, and Variety (see “Explaining Big Data” video below).

In order to have an idea of the magnitude of volume and velocity of this Big Data deal, I did quite a bit of internet searching. What I ended up doing in order to acquire the following numbers was to pull from a number of sources and synthesize the information found therein in order for the magnitude to be understood in terms of the information stored in the United States Library of Congress. Thus, based on the amount of data contained/stored in the Library of Congress, I carried out some pretty basic multiplication and arrived at the numbers you see below. Naturally, these are approximations and as with everything else in life the veracity of the information is best validated by your own research.

Unit of measurement:

60,000 Libraries of Congress = ~13 exabytes of data

Source: McKinsey Global Institute, “Big data: The next frontier for innovation, competition, and productivity” (2011) p.15


In 2011, the global data storage capacity was ~590 exabytes = ~2,273,077 Libraries of Congress

Source: (See Dr. Martin Hilbert’s video below)


The traffic flowing over the internet annually is 667 exabytes = ~3,078,462 Libraries of Congress of data and rising.

Source: The Economist, “Data, data everywhere” (2 Feb, 2010)


Unstructured data – Data that lacks an identifiable structure (e.g., words, images, video, streams of sensor data, PDF files, e-mail messages, blogs, Web pages).

Structured data – Data that resides in fixed fields within a record or file–machine readable data (e.g., spreadsheets, databases, XML).

Here’s a visual showing unstructured data getting filtered into individual components (a la bits) and then becoming structured data:


Source: hadapt.com, “Threat Detection and Analysis

Why does Big Data as we know it exist?

  • Retailers are collecting vast amounts of data on our consumer behaviors
  • Sectors such as finance, healthcare, logistics, education, government are collecting and storing data to learn more about people and processes, and advance knowledge
  • Public social media is producing an explosive of amount of digital material comprised of our communications
  • Biometric technology (e.g., iris scanners, facial and voice recognition systems, DNA and fingerprint scanners)
  • The Internet of Things (e.g., networked automobiles, refrigerators, household utilities and appliances)
  • Scientific Research which has become increasingly digital

Source: “Explaining Big Data” video by Christopher Barnatt

Here are some videos I feel are helpful in making sense of the meaning of Big Data:

The best video I’ve come across demystifying Big Data is the following created by EMC Corporation:

or, maybe you’ll find Dr. Martin Hilbert’s (USC Annenberg School of Communication and Journalism) Big Data video more interesting with its focus on specific magnitudes of storage and data generation:

or, maybe Christopher Barnatt’s video, Professor of Computing and Future Studies in Nottingham University Business School, is more your style:

The numbers were always there. From the time that quantum fluctuation (aka The Big Bang) gave birth to our beautiful sea of cosmic wilderness, the numbers were running wild. For millennia, our species has been discovering numbers embedded in the Universe like children discover the Easter eggs their parents hide for them. We’ve been discovering patterns in planetary motion, marveling at numeric sequences in flowers and other natural life, uncovering the melodies of numbers in stringed music and cosmic strings alike, excavating the secrets of our molecular form, and measuring the skies and oceans in the hope that we have not doomed our planetary home to a future of savage weather and/or extreme desertification. Yes, we  are number hunters, seekers of concrete truth in the unquestionable number. Now, we have the tools to begin integrating numbers (lots of numbers = data & lots of data = information) into our lives like never before.

In his Question Concerning Technology, Martin Heidegger investigates, amongst other ideas, humanity’s relationship to technology and what that relationship seeks and is able to bring forth into existence. These days it seems that much of humanities bringing forth activity is the excavation of vast quantities of numbers, especially in government and corporate sectors. Numbers can be pretty cool, as they can give us a perspective on things we might not have considered before. However, getting caught up in collecting more data and creating the technology that has the capacity to generate more data is about as useful as building an ever expanding library of books, while never reading them. Maybe we should be thinking more in terms of “Big Synthesis,” or making more sense of the data *we currently have* as opposed to Big Data, creating news ways of generating and storing mass amounts of data that we are never fully at a place of understanding. So, in terms of Big Synthesis the question becomes the following: How do we develop the capacity to understand our relationship to the numbers we bring forth. Moreover, in understanding our relationship to the data we generate, how can it improve our understanding of the World, our universe, and our place in it.

For more information on numbers of magnitude as they pertain to data see the chart below:

Source: The Economist, “All too much: monstrous amounts of data.” (2/10/2010) <http://www.economist.com/node/15557421&gt;

or, visit the following link belonging to the blog of a guy named Ted Holmes:


Big Data: Some stories on societal impact

Banks Using Big Data to Discover New Silk Roads (CIO Journal, 2013)

“JPMorgan Chase & Co., the largest commercial bank in the U.S., generates a vast amount of credit card information and other transactional data about U.S. consumers. Several months ago, it began to combine that database, which includes 1.5 billion pieces of information, with publicly available economic statistics from the U.S. government. Then it used new analytic capabilities to develop proprietary insights into consumer trends, and offer those reports to the bank’s clients. The technology allows the bank to break down the consumer market into smaller and more narrowly identified groups of people, perhaps even single individuals.”

How Obama’s data crunchers helped him win (CNN, 2012)

“Barack Obama’s campaign to victory noticed that George Clooney had an almost gravitational tug on West Coast females ages 40 to 49. The women were far and away the single demographic group most likely to hand over cash, for a chance to dine in Hollywood with Clooney — and Obama”

‘Why did we put Barack Obama on Reddit?” an official asked rhetorically. “Because a whole bunch of our turnout targets were on Reddit.’

‘Big Data’ for Cancer Care (WSJ, 2013)

“A major oncology group is launching an ambitious project to collect data on the care of hundreds of thousands of cancer patients and use it to help guide treatment of other patients across the health-care system.”

The Promise of Big Data (HSPH News, 2012)

“What was happening in Sarah Fortune’s lab is playing out in laboratories, businesses, and government agencies everywhere. Our ability to generate data has moved light-years ahead of where it was only a few years ago, and the amount of digital information now available to us is essentially unimaginable.”

  • MINE (Harvard and MIT)

How companies are using your social media data (Mashable.com, 2010)

“Companies are mining the social web to build dossiers on you. Information posted publicly on blogs, Facebook, Twitter, forums and other sites is fair game. It is yet another reminder that people need to be aware of what they are posting on social networking sites and to whom they’re connected.”

Small devices and Big Data (American Armed Forces Journal, 2012)

“In Iraq, U.S. forces who recovered computers used by al-Qaida consistently found Google Maps information on them. Insurgents were using the same databases as U.S. forces to view streets, consider get-away routes and plan ambushes.”

The NSA is Building the country’s Biggest Spy Center (Watch What You Say) (Wired, 2012)

“In his 1941 story “The Library of Babel,” Jorge Luis Borges imagined a collection of information where the entire world’s knowledge is stored but barely a single word is understood. In Bluffdale the NSA is constructing a library on a scale that even Borges might not have contemplated. And to hear the masters of the agency tell it, it’s only a matter of time until every word is illuminated.”

Here’s my personal statement:

Many feel that the development of technology and how we apply it to our lives is something fraught with peril, while others feel that we are in a time of great discovery, and thus are hopeful, while still others could really care less so long as their devices still function and no one shuts off the power. Books like Isaac Asimov’s, I Robot, George Orwell’s, 1984, Philip K Dick’s, Blade Runner and movies like The Minority Report and The Matrix inform us that technology brings with it some serious potential for trouble, and that’s putting it lightly. They inform us that technology is something that will eventually elude our control and/or be used for all of the wrong reasons, warning us that we’ve already opened up Pandora’s Jar (Yes, the ancient Greeks defined it as a jar not a box.) and there’s no closing it now. The thing is that technology/tool making seems to really be a manifestation of our very essence. At our core, it seems that our species is a tool creator, an organism that can’t help but investigate and create things, whether it be for good or bad. It’s in our nature to question, create, and dream, and I’m pretty sure that others would probably add destroy to that list. Whatever we are, how ever we do things, and wherever we are going, we’re always going to create and bring technology with us. As for me, I’m a believer in our capacity to question, dream, and bring forth the good more than anything else we do. As for the naysayers, I leave the following Martin Heidegger quote to ponder:

Where danger grows, so too does the saving power.

Additional resources on Big Data:

Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2012–2017

Big data: The next frontier for innovation, competition, and productivity (McKinsey & Co., 2011)



Out of the pool and into the ocean

“…if scientists could communicate more in their own voices-in a familiar tone, with a less specialized vocabulary-would a wide range of people understand them better?  Would their work be better understood by the general public, policy-makers, funders, and, even in some cases, other scientists?” -Alan Alda

I love science.

I have spent many years teaching science and I feel like I have a pretty broad understanding of most scientific concepts.

However, last week I attended a 3 day meeting with a variety of respected researchers who are studying the effects of the Deepwater Horizon oil spill on the Gulf of Mexico, and I was struck by how little of the presentations and panel discussions I was able to absorb, especially when it came to the more abstract nature of their work.  If I, a person with a marked interest and a science background, was struggling for comprehension, what must it be like for the general public?

Now of course the audience at this meeting was made up of people who were in the know and didn’t need their science “dumbed down” if you will, but the language seldom changes when efforts are made to communicate with those outside the scientific sphere.  Science communication and outreach are fast becoming important facets of research and are starting to become tied to funding in big ways.  As the looming sequester has the potential to shrink federal support for research and development (R&D) by US$57.5 billion over the next five years (Nature), it is imperative that scientists work to build a case for the relevance of their work and that means targeting those who do not necessarily speak the same language that they do, namely taxpayers and policymakers!

Where do information professionals come into play in all this? Well, our work is intricately tied together with the researchers and academics that we serve.  Their cause is our cause and our background in outreach, communication, and research can be a valuable tool in making scientific research more accessible and relatable.

Science Communication Table

Source: Richard C. J. Somerville and Susan Joy Hassol, from the October 2011 issue of Physics Today, page 48:

Some Helpful Resources

-A self-directed course on scientific communication from Nature


-Similar resource from the American Association for the Advancement of Science


The Journal of Science Communication


-Presentations from the National Academy of Sciences Colloquium “The Science of Science Communication”


-Science communication competitions!


National Science Communication Institute




This is a great presentation on social media for scientists.


If taxpayers paid for it, they own it

The push for open access in research got a big boost on Friday with the release of a policy memorandum from the White House Office of Science and Technology Policy Director John Holdren, which has directed Federal agencies with more than $100M in R&D expenditures to develop plans to make the published results of federally funded research freely available to the public within one year of publication and requiring researchers to better account for and manage the digital data resulting from federally funded scientific research.  Whew that was a long sentence!  This is a precursor to the hopefully eventual ratification of the Fair Access to Science and Technology Research Act that was introduced both in the house and the senate on February 14th, 2013.  Check out this Wiki from the Harvard Open Access Project for more information on that and to track the progress of FASTR in congress.

Those of us in the know on this issue, which let’s face it should be all of us in academia, have seen the writing on the wall for quite some time.  And it’s thanks to the continued activism, advocacy, and articulate passion of so many in the academic community that we are at this point today.  If you have a moment, please read this joint letter from ARL, ALA, Creative Commons, PLOS, and many others offering support and a simple, but resonating rationale for the open access of scientific research.


Let’s Face It, No One Can Read Everything

I read.  A lot.  Still I’m constantly amazed at how little of the massive amounts of information available to me I’m actually able to absorb.  It is no surprise that this is an affliction shared by the research community.  Scientific research is available for public consumption in ways that would have seemed unimaginable just a few short years ago and the sheer volume can be overwhelming.

“In growing numbers, scholars are integrating social media tools like blogs, Twitter, and Mendeley into their professional communications. The online, public nature of these tools exposes and reifies scholarly processes once hidden and ephemeral. Metrics based on this activities could inform broader, faster measures of impact, complementing traditional citation metrics.”


These alternative metrics, or altmetrics as they are commonly referred to, are increasingly gaining credence as a way to track the sphere of influence of social media in the scientific community.  It also serves to help sift the wheat from the chaff so to speak.  What is truly worth your time to read?  What are other like minded folks in your field reading?  What is the…


The new model is starting to look like this.


Publication in a peer-review journal is not the only way to effectively measure the impact of research, especially now with the push for open access and the quickly becoming outdated model of traditional publication.

Check out IU E-Science librarian Stacy Konkiel’ s great talk on the potential uses of alt-metrics in libraries.

Click through these slides from Heather’s Piwowar’s talk on altmetrics from ALA mid-winter.

And finally some recommended reading just in case you have lots of time on your hands…

-Good intro to ways in which altmetrics are being used and their potential impact


-Feel like you need a dissenting viewpoint?  Check out this editorial from Nature


Great article on using social media to explore scholarly impact, by the founders of  Impact Story


-Tools from PLOS to measure research impact


-Bibliography of articles on altmetrics from PLOS


-Article discussing the new guidelines for grant applications for NSF which asks a principal investigator to list his or her research “products” rather than “publications” in the biographical sketch section.


-Quick blog post on altmetrics as a discovery tool


Peace, Love, and Big Data

In my mind there is no question that big data is the buzzword of 2012.  Everyone from CEO’s to tech geeks are batting the term around like it’s the answer to all of the World’s problems.  A few brave naysayers have said, “Big Data? Big deal…”, but for the most part there is a sense of excitement over what is possible with all of the terabytes of data that are collected daily.  Though much of the energy is business and profit driven, there are also many data scientists who are passionate about using big data for the greater good.  One such start-up, DataKind, is endeavoring to match the skills of data scientists with non-profits who could benefit from their expertise with big data.  To date they have sponsored eight Data Dives in various parts of the country where they match up non-profit social organizations with volunteer data scientists who spend a weekend tackling their data challenges.


One such event generated this map of storm surge risk in NYC.  As this was created in September it proved to be prophetic in determining the outcome of Hurricane Sandy the following month!


NYC Data Drive


DataKind founder Nate Porway, who was named a National Geographic Emerging Explorer for 2012, believes that this is a match that has been waiting to happen, “We’re connecting nonprofits, NGOs, and other data-rich social change organizations with data scientists willing to donate their time and knowledge to solve social, environmental, and community problems.  Data is like a bucket of crude oil. Potentially great, but only if someone knows how to refine it (data scientists) and someone else has vehicles that will run on it (the social sector).”




A recent white paper from the World Economic Forum highlights the ways in which big data can have a big impact (I’m learning that people love to find other words to attach big to when they are writing about big data!) on international and social development.


This is all well and good but as as blogger Zach Gemignani wrote recently, “All the work of collecting, combining, and modeling data is wasted if not enough attention is paid to how the data is shared. The data needs to be transformed into bite-sized (pre-chewed, even) stories that can easily stick in the brains of your audience.”


In other words the excitement over big data’s potential for change needs to be combined with practical and usable applications.  Organizations like DataKind, which has started to inspire spin-offs on college campuses across the country, can be instrumental in helping this ideal to become reality.

Some Related Reading!

5 Things That Will Change the Way Nonprofits Work in 2013

Big Data, Big Hype: Big Deal

Links to other great Forbes articles as well!

The Age of Big Data


Humanizing Big Data


“We question. We research. We catalog. We quantify. We aggregate, calculate, communicate, analyze, extrapolate and conclude. And eventually, if we’re fortunate and thoughtful, we understand.”


This quote, from Associated Press Editor-At-Large Ted Anthony, could be about librarians or even a treatise on human nature in general.  Instead it is used to describe an ambitious multifaceted project from prolific photographer Rick Smolan, who is best known for his work on the “Day in the Life” series of photo collections.  The Human Face of Big Data, which was released on December 4th and will soon be followed by a documentary, captures in photos and short articles the essence of big data real-world and personal applications.

In the book, big data is defined as the real time collection, analyses, and visualization of vast amounts of the information.  “In the hands of data scientists this raw information is fueling a revolution which many people believe may have as big an impact on humanity going forward as the Internet has over the past two decades. Its enable us to sense, measure, and understand aspects of our existence in ways never before possible.”  Amazon.   The following interview with the author provides some great background information.

In addition to the book, a free mobile app has been launched, “to help you learn about yourself, how you compare to others, and what your phone can tell you about your life.  Compare answers about yourself, your family, trust, sleep, sex, dating, and dreams with millions of others around the world.  Find your Data Doppelganger. Map your daily footprint, share what brings you luck, and get a glimpse into the one thing people want to experience during their lifetime.”http://humanfaceofbigdata.com/about/

In less than two months, more than 3 million share and compare questions have been answered, in more than 100 countries.  Through the app some interesting data insights have been extrapolated.  Check them out!


This is a cool project and I look forward to watching the documentary when it comes out.  In addition, the results of a worldwide, user submitted video contest will be coming out shortly which will undoubtedly provide us with some awesome snapshots of the “Human Face of Big Data”.



Just as a side note this article on, “5 Trends That Will Shape Digital Services In 2013” was pretty interesting and relevant!