Understanding Big Data

Without getting too analytical about my statement, I like to think of “Big Data” as an all encompassing term describing humanity’s capacity to collect and analyze the numbers we generate in our lives, and those already existing in the cosmos.

In an earlier post called “Peace, Love, and Big Data” my colleague and friend, Amelia Vaughan (see below) highlighted some of the positive work resulting from the Big Date trend. Though the following post is also about the trend, I am going to try and build a bit on Amelia’s earlier post by looking at the meaning of Big Data, and then taking a look at some of the ways in which the data trend is being used and how it is impacting our lives.

As soon as the Big Bang gave birth to our universe, the numbers came flowing, waiting to be discovered. Since the dawn of our species, a creative/investigative organism, we’ve been gleaning data from our universe in order to understand it and our place in the world. Thus, data collection is really nothing new. In fact, it’s been an activity we have engaged in for millennia. So, given our extensive history of data collection, why is it that we now feel it is appropriate to say that we are in an era of “Big Data.” Honestly, it seems to me that many scientific disciplines have been living in an era of Big Data for centuries. Many would echo this sentiment  as they feel Big Data is nothing more than a business intelligence catch phrase conveniently putting a label on our current relationship to the technology/tools that facilitate our capacity to discover and generate vast amounts of information. Others feel that Big Data is a phenomenon worthy of special attention, seeing this time as a truly unique feature of the development of our species. And still others feel that whatever this thing is it really does not matter as long as their internet devices allow them to remain plugged into the world online. However you want to or not want to classify it Big Data as we know it can be boiled down to three things: Volume, Velocity, and Variety (see “Explaining Big Data” video below).

In order to have an idea of the magnitude of volume and velocity of this Big Data deal, I did quite a bit of internet searching. What I ended up doing in order to acquire the following numbers was to pull from a number of sources and synthesize the information found therein in order for the magnitude to be understood in terms of the information stored in the United States Library of Congress. Thus, based on the amount of data contained/stored in the Library of Congress, I carried out some pretty basic multiplication and arrived at the numbers you see below. Naturally, these are approximations and as with everything else in life the veracity of the information is best validated by your own research.

Unit of measurement:

60,000 Libraries of Congress = ~13 exabytes of data

Source: McKinsey Global Institute, “Big data: The next frontier for innovation, competition, and productivity” (2011) p.15

Volume:

In 2011, the global data storage capacity was ~590 exabytes = ~2,273,077 Libraries of Congress

Source: (See Dr. Martin Hilbert’s video below)

Velocity:

The traffic flowing over the internet annually is 667 exabytes = ~3,078,462 Libraries of Congress of data and rising.

Source: The Economist, “Data, data everywhere” (2 Feb, 2010)

Variety:

Unstructured data – Data that lacks an identifiable structure (e.g., words, images, video, streams of sensor data, PDF files, e-mail messages, blogs, Web pages).

Structured data – Data that resides in fixed fields within a record or file–machine readable data (e.g., spreadsheets, databases, XML).

Here’s a visual showing unstructured data getting filtered into individual components (a la bits) and then becoming structured data:

https://i0.wp.com/hadapt.com/assets/Threat-Detection-and-Analysis.jpg

Source: hadapt.com, “Threat Detection and Analysis

Why does Big Data as we know it exist?

  • Retailers are collecting vast amounts of data on our consumer behaviors
  • Sectors such as finance, healthcare, logistics, education, government are collecting and storing data to learn more about people and processes, and advance knowledge
  • Public social media is producing an explosive of amount of digital material comprised of our communications
  • Biometric technology (e.g., iris scanners, facial and voice recognition systems, DNA and fingerprint scanners)
  • The Internet of Things (e.g., networked automobiles, refrigerators, household utilities and appliances)
  • Scientific Research which has become increasingly digital

Source: “Explaining Big Data” video by Christopher Barnatt

Here are some videos I feel are helpful in making sense of the meaning of Big Data:

The best video I’ve come across demystifying Big Data is the following created by EMC Corporation:

or, maybe you’ll find Dr. Martin Hilbert’s (USC Annenberg School of Communication and Journalism) Big Data video more interesting with its focus on specific magnitudes of storage and data generation:

or, maybe Christopher Barnatt’s video, Professor of Computing and Future Studies in Nottingham University Business School, is more your style:

The numbers were always there. From the time that quantum fluctuation (aka The Big Bang) gave birth to our beautiful sea of cosmic wilderness, the numbers were running wild. For millennia, our species has been discovering numbers embedded in the Universe like children discover the Easter eggs their parents hide for them. We’ve been discovering patterns in planetary motion, marveling at numeric sequences in flowers and other natural life, uncovering the melodies of numbers in stringed music and cosmic strings alike, excavating the secrets of our molecular form, and measuring the skies and oceans in the hope that we have not doomed our planetary home to a future of savage weather and/or extreme desertification. Yes, we  are number hunters, seekers of concrete truth in the unquestionable number. Now, we have the tools to begin integrating numbers (lots of numbers = data & lots of data = information) into our lives like never before.

In his Question Concerning Technology, Martin Heidegger investigates, amongst other ideas, humanity’s relationship to technology and what that relationship seeks and is able to bring forth into existence. These days it seems that much of humanities bringing forth activity is the excavation of vast quantities of numbers, especially in government and corporate sectors. Numbers can be pretty cool, as they can give us a perspective on things we might not have considered before. However, getting caught up in collecting more data and creating the technology that has the capacity to generate more data is about as useful as building an ever expanding library of books, while never reading them. Maybe we should be thinking more in terms of “Big Synthesis,” or making more sense of the data *we currently have* as opposed to Big Data, creating news ways of generating and storing mass amounts of data that we are never fully at a place of understanding. So, in terms of Big Synthesis the question becomes the following: How do we develop the capacity to understand our relationship to the numbers we bring forth. Moreover, in understanding our relationship to the data we generate, how can it improve our understanding of the World, our universe, and our place in it.

For more information on numbers of magnitude as they pertain to data see the chart below:

Source: The Economist, “All too much: monstrous amounts of data.” (2/10/2010) <http://www.economist.com/node/15557421&gt;

or, visit the following link belonging to the blog of a guy named Ted Holmes:

http://simplyted.blogspot.com/2005/12/how-to-visualize-data.html

Big Data: Some stories on societal impact

Banks Using Big Data to Discover New Silk Roads (CIO Journal, 2013)

“JPMorgan Chase & Co., the largest commercial bank in the U.S., generates a vast amount of credit card information and other transactional data about U.S. consumers. Several months ago, it began to combine that database, which includes 1.5 billion pieces of information, with publicly available economic statistics from the U.S. government. Then it used new analytic capabilities to develop proprietary insights into consumer trends, and offer those reports to the bank’s clients. The technology allows the bank to break down the consumer market into smaller and more narrowly identified groups of people, perhaps even single individuals.”

How Obama’s data crunchers helped him win (CNN, 2012)

“Barack Obama’s campaign to victory noticed that George Clooney had an almost gravitational tug on West Coast females ages 40 to 49. The women were far and away the single demographic group most likely to hand over cash, for a chance to dine in Hollywood with Clooney — and Obama”

‘Why did we put Barack Obama on Reddit?” an official asked rhetorically. “Because a whole bunch of our turnout targets were on Reddit.’

‘Big Data’ for Cancer Care (WSJ, 2013)

“A major oncology group is launching an ambitious project to collect data on the care of hundreds of thousands of cancer patients and use it to help guide treatment of other patients across the health-care system.”

The Promise of Big Data (HSPH News, 2012)

“What was happening in Sarah Fortune’s lab is playing out in laboratories, businesses, and government agencies everywhere. Our ability to generate data has moved light-years ahead of where it was only a few years ago, and the amount of digital information now available to us is essentially unimaginable.”

  • MINE (Harvard and MIT)

How companies are using your social media data (Mashable.com, 2010)

“Companies are mining the social web to build dossiers on you. Information posted publicly on blogs, Facebook, Twitter, forums and other sites is fair game. It is yet another reminder that people need to be aware of what they are posting on social networking sites and to whom they’re connected.”

Small devices and Big Data (American Armed Forces Journal, 2012)

“In Iraq, U.S. forces who recovered computers used by al-Qaida consistently found Google Maps information on them. Insurgents were using the same databases as U.S. forces to view streets, consider get-away routes and plan ambushes.”

The NSA is Building the country’s Biggest Spy Center (Watch What You Say) (Wired, 2012)

“In his 1941 story “The Library of Babel,” Jorge Luis Borges imagined a collection of information where the entire world’s knowledge is stored but barely a single word is understood. In Bluffdale the NSA is constructing a library on a scale that even Borges might not have contemplated. And to hear the masters of the agency tell it, it’s only a matter of time until every word is illuminated.”

Here’s my personal statement:

Many feel that the development of technology and how we apply it to our lives is something fraught with peril, while others feel that we are in a time of great discovery, and thus are hopeful, while still others could really care less so long as their devices still function and no one shuts off the power. Books like Isaac Asimov’s, I Robot, George Orwell’s, 1984, Philip K Dick’s, Blade Runner and movies like The Minority Report and The Matrix inform us that technology brings with it some serious potential for trouble, and that’s putting it lightly. They inform us that technology is something that will eventually elude our control and/or be used for all of the wrong reasons, warning us that we’ve already opened up Pandora’s Jar (Yes, the ancient Greeks defined it as a jar not a box.) and there’s no closing it now. The thing is that technology/tool making seems to really be a manifestation of our very essence. At our core, it seems that our species is a tool creator, an organism that can’t help but investigate and create things, whether it be for good or bad. It’s in our nature to question, create, and dream, and I’m pretty sure that others would probably add destroy to that list. Whatever we are, how ever we do things, and wherever we are going, we’re always going to create and bring technology with us. As for me, I’m a believer in our capacity to question, dream, and bring forth the good more than anything else we do. As for the naysayers, I leave the following Martin Heidegger quote to ponder:

Where danger grows, so too does the saving power.

Additional resources on Big Data:

Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2012–2017

Big data: The next frontier for innovation, competition, and productivity (McKinsey & Co., 2011)

 

Welcome to Hack E-Science Librarianship!

The inspiration for this blog is taken from E-Science Librarians I have encountered over the course of 2011. Since August 2010, the start date of my E-Science Librarianship work at the Florida State University, I have found that their are a number of E-Science librarians who are all struggling with similar situations at their respective institutions. Many of us are the first of our kind in the profession and with that role has come the need to hack our way through the E-Science Librarianship frontier. As we hack, we are defining our positions, seeking to skill up, providing the traditional services that science librarians have always provided, and reaching out to one another in order to learn from our shared experiences.

Though there are a number of initiatives that have been developed by scientific research institutions, library/information science organizations, and research universities worldwide, there still remains a need for librarians to share their individual experiences and knowledge of resources. It is my hope that this blog will provide a knowledge sharing forum for those E-Science librarians and information professionals who are new to this emerging profession. Naturally, experience sharing should and will not be limited to those of us who are new to the profession…ALL ARE ENCOURAGED to bring something to this forum as this forum is about COLLECTIVE INTELLIGENCE.

Let’s learn from each other so that we may push E-Science Librarianship forward!