Big Data Technology Explained

If you’re active in any field of computing you’ve heard the term Big Data thrown around in the past couple of years. If you’re in a business that has lots of data to analyze then you should have a big interest in Big Data, but you may not fully comprehend what we tech geeks are talking about. Big Data has become sort of a buzz word, and for a good reason. Big Data is a very important and growing facet of the modern technological world. My goal here is to give a view of Big Data from the techie standpoint and to introduce you in a general way to some technologies like Google’s BigQuery and Apache’s Hadoop that we techies immediately think of when we hear Big Data.

Big Data Defined

I get excited any time somebody mentions Big Data in connection with a project I’m working on, but I’m usually disappointed because a lot of people use Big Data as a term to emphasize the importance of a data set, rather than to describe the nature of the dataset. The other common misconception is just the sheer underestimation of how big Big Data really is. Do you have a database with 10 million customer records? To a techie that probably fits pretty squarely into the ‘regular data’ realm rather than Big Data.

I recently found a definition that I thought was good. Unfortunately it’s not concise, but I can summarize. Big Data doesn’t just refer to size in gigabytes of a dataset, but also the complexity of that dataset. A Big Data dataset is usually one that has a large volume of data, but also that data tends to be relatively unstructured (especially when it’s compared to the structured data usually found in a regular relational database) or has complex relationships. The full definition and explanation is on MIKE2.0.

Big Data Concepts

To fully grasp the role of Big Data technologies you should first know what I mean when I say MapReduce and NoSQL. These are topics that can get pretty tough, but I’ll define them generally.

MapReduce – MapReduce is a programming model developed by Google for the purpose of processing large amounts of data. If you want to perform calculations on a large set of data then MapReduce is for you.

NoSQL – NoSQL refers to a broad set of database technologies that break from the traditional model for storing data in a structured fashion. In NoSQL databases the emphasis is on quickly storing and reading massive amounts of data. As a trade-off they generally lose some consistency in terms of data access. This means it might take some time for data to propagate to all of the servers, so querying data can result in out of date results. NoSQL implementers should evaluate whether or not it’s it’s important to be able to query new data the instant it’s added to the database.

Big Data Technologies

So hopefully you’ve gathered by now that Big Data is a wide field with a number of things to consider when picking technologies to house and serve your data, and befitting a large technological problem there are a number of solutions available, most of which aren’t a stand-alone solution to the Big Data problem. These software packages that are available to make working with Big Data easier are best used in conjunction with other software and services to make up your whole data management solution. There are many solutions to choose from, but I want to cover just a few of the most popular ones that you’re most likely to run into.

Apache’s Hadoop

Hadoop is a popular open source MapReduce framework managed and distributed by the Apache Software Foundation. Hadoop at its simplest is a framework for distributing MapReduce work across a cluster of many servers. Individual servers can be added or removed from a Hadoop cluster with little effort, so if you anticipate an incoming spike in data then you can add servers and then remove them after the spike subsides. This model of distributed computing across a cluster of inexpensive hardware is typical of most MapReduce frameworks. Apache also distributes a NoSQL database solution and a number of other Big Data software tools as a part of the Hadoop project. The popular data analysis software Tableau actually can integrate with a dataset stored in a Hadoop NoSQL cluster. If you already know how to use Tableau then there’s pretty limited learning curve for data analysts.

Google’s BigQuery

BigQuery is a very cool new service provided by Google for the storage and querying of big unstructured data. Google’s goal with BigQuery is to build a database that can store vast amounts of data and very quickly return results for ad-hoc queries (their goal was to be able to scan a 1 terabyte table in one second). You can access your data with SQL through a browser based interface or a REST based API. It’s important to note that BigQuery is primarily a tool for analysis. You can dump in billions of rows of records and perform fast ad-hoc queries to give you important actionable information about your dataset, but it’s not meant to be a database backend for an application.

MongoDB

MongoDB is a special kind of NoSQL database called a ‘document store’. Mongo is a database that allows you to easily ‘shard’ data across multiple servers. Much like a hadoop cluster you can create a mongo cluster and add or remove servers very easily. Unlike hadoop, mongo is primarily a data storage system meant for the storage and quick retrieval of large quantities of data. In addition mongo is a fairly mature technology and has many features that make it a viable potential replacement for traditional relational databases as the backend database for applications.

Redis

Redis is another NoSQL solution, but is very different from MongoDB. Redis stores arbitrary key value pairs only in perishable memory. The goal of redis is super-fast lookup and read times on data and for this reason it competes directly with Memcached as a caching solution. The nature of the in-memory storage of redis is that you must have some sort of on-disk database solution (another NoSQL solution, or even a relational database solution like MySQL). Redis is a great tool for dealing with Big Data in the context of an application that delivers data to many users.