"Big Data" is an unavoidable phrase currently, especially in online marketing circles. After you’ve tried to read the myriads of blogs, guides & whitepapers published daily on this topic you’re left with the impression that Big Data will:
- Generate you huge amounts of revenue
- Solve all your business issues
- Cure Cancer, end world hunger and wars
- If you don’t hop on the bandwagon soon your business is doomed.
Almost everything you read about big data concerns the exciting application of the new types of data available to organisations in a world where human activity increasingly takes place online.
But what actually is Big Data, and how can you make use of it?
Photo by Camelia.boban
The Definition of Big Data
If you ask a data developer and a marketing expert what big data is, you’ll probably receive two different responses:
The Marketing expert might tell you:
“Big Data is the exciting application of datasets increasingly available in the modern world”
The techie might say:
“Big Data is a methodology of storing and processing data”
Depending on your viewpoint either of these definitions could be valid, the first definition could apply to almost anything so in this post we’ll briefly cover what Big Data is from a technical perspective (while trying not to get too technical).
When does data become BIG?
So is 1 million rows of data BIG? How about a billion, or perhaps a trillion?
This is a logical question, but there is no strict threshold, since from a technical viewpoint Big Data is really more about the platform and methodology than just pure numbers.
One commonly cited definition of when data becomes BIG are the “Three V’s”:
|Volume||When the size of the data becomes impractical on traditional database systems|
|Velocity||When traditional database systems struggle to process and move data quickly enough|
|Variety||When you want to use messy and varied data not acceptable to traditional database systems|
Now let’s look at what a “traditional” method of storing data is, and then compare it to Big Data techniques.
The Traditional Relational Database
The relational database management system (RDBMS) is the most popular way of storing and processing data in the world today. The chances are that almost every software, business tool and website you encounter in your day to day life will use this type of database in some way to store and retrieve data.
Relational databases are all about structured data, where you carefully organise and divide your data into tables and columns and define strict relationships between each entity (known as normalisation). A key characteristic of a RDBMS is that any data it stores has to be very clean and conform to strict pre-defined rules; if data doesn’t meet these constraints the concept falls apart.
This inflexibility is the RDBMS's greatest strength, it makes them reliable, auditable, extremely secure, and they always provide the systems that talk to them with consistent reliable data.
Relational databases can deal with huge quantities of data, however their characteristics mean they don’t scale easily without a lot of time, money and expertise. Also, due to their strict requirements they are not suited to processing data that might contain a lot of junk or is unpredictable in nature.
So how do these traditional databases stack up against the “Three V’s?” – not particularly well, especially if you want to keep costs down.
If you have an awful lot of unstructured data to store and process, you’re going to need something with the capabilities.
Big Data Technology – Many Hands Make Light Work
If you wanted to summarise Big Data methodology into a single word we would define it as “Parallelism”.
Big data technology differs from traditional RDBMS’s in that it’s designed around the whole concept of dividing the work and storage up between as many de-centralized machines as you wish.
The most popular big data platform is Hadoop which uses its own distributed /wp-content/uploads/file system for storing data, and a method known as MapReduce for processing it. MapReduce works around the concept that a processing job is divided up again and again, often down many machines (or nodes). If you want to store more data or process it faster, you just add more boxes with no extra development work.
Big Data platforms are relatively cheap, with the most popular platforms being open-source, all you have to worry about it the hardware and paying the developers.
In fact even those costs can be cut by renting resource from services like Amazon Web Services, or paying for the increasing number of new software tools that present a user-friendly interface over the top of this technology.
All this makes Big Data reasonably accessible to anyone who has the inclination.
Practical Uses of Big Data
Having covered the technical bit, how are the real-world applications relevant to a business that operates online?
Uses of Big Data generally fall into 2 categories:
Getting insights and visualisations efficiently from huge, varied datasets to inform business decisions. A great example of this can be seen with T-Mobile reducing customer churn.
Using Big Data technology to power a functional feature or solve a technical challenge. A good example of this can be seen with the way Amazon has been increasing the personalisation of web content for the last few years.
"Thick Data" and "Thin Data"
There is also further discussion to be had over “thick and thin” data. This is the way we like to distinguish between data sets at equimedia.
Thick Data is how we refer to data that has a large number of variables and facts of real value appended to each data point (for example customer databases that collect personal contact information, financial information, specific user preferences etc.).
Thin Data is more typically associated with non-personally identifiable digital tracking that are of little value on their own, but when aggregated can give huge analytical value (impressions, clicks, page views etc.).
The key challenge is to unify the thick and thin data so you are aggregating the huge number of “thin” behavioural touch points with the more personal “thick” data. We are currently working on a number of projects with Clients to combine this data and increase the value of their data overall. These projects may not be officially classed as “Big Data” from a technical perspective, but they will certainly deliver additional revenue for these Clients.
Big Data Jargon Buster
The following section covers some of the most common products in the world of traditional and big data databases:
|Map Reduce ||Pioneered by Google to power its search engine, MapReduce is a method of processing huge amounts of data by efficiently organising and dividing up the work across scalable machines (or Nodes).|
Each problem is divided up into smaller tasks (Map) and the results from the various nodes are collected (Reduce) to provide the solution.
A popular implementation of the MapReduce method is Hadoop.
|Hadoop ||Pioneered by employees at Yahoo, Hadoop is the most popular open-source implementation of the MapReduce method. It is written in a programming language called Java.|
Hadoop refers to an ecosystem of tools and technology including the MapReduce part for processing and the Hadoop Distributed File System (HDFS) for data storage.
Pretty much all the main online brands use Hadoop to some extent – and it is often only part of their Big-Data solution, being plugged into NoSQL databases or traditional relational databases to get the data where it needs to be.
Note that Hadoop isn’t the final word in open-source Big Data processing, it has its strengths and weaknesses and has been around since 2005. Alternative systems are emerging from the basements of the big social networks that may someday overtake Hadoop, such as Google’s BigQuery, Giraph, Amazon Redshift and Twitter Storm.
|NoSQL Databases ||NoSQL (Not only SQL) refers to a broad collection of scalable databases. Their emphasis is on scalable storage, fast read/write times and looser structure which makes them ideal for Big Data purposes.|
They are hugely varied in technology, although the one characteristic they share is that they all are completely different from traditional relational databases.
|Graph Database ||Graph databases fall under the “NoSQL” umbrella and they store data as nodes and edges, rather than tables and columns. They are of particular note due to the rise of Social and Big Data.|
Graph databases are particularly suited to storing any data relating to social, connections, nodes and relationships.
Graph databases are used by all the major search engines and social networks, but they are increasingly making their way into the business world, powering recommendation engines and the social elements of retail websites. The most popular Graph databases include Neo4j and Twitter’s own FlockDB. The rise of social-like data is leading to the rise of alternatives to Hadoop such as Twitter Storm and Giraph which are ideally suited to processing graph data.
|Amazon Web Services ||AWS is by far the most widely used cloud service for Big Data. AWS refers to a vast range of services that all stick to the concept that you pay for what you need. AWS integrates tightly with Hadoop.|
A great example of AWS in use would be The New York Times processing 4TB of images into 11 million PDF’s using 24 hours of AWS time, at a cost of just $240.
AWS makes Big Data technology incredibly accessible to small businesses and individuals.
The concept of cloud computing that AWS pioneers has also led to the rise of user-friendly Big-Data analytics packages that makes use of the cloud behind the scenes to allow non-techies to query big data without investing in hardware.
|Microsoft Azure ||This could be thought of as Microsoft’s version of AWS. It doesn’t offer the same breadth of Amazon’s services but aims to provide similar big-data related platforms such as cloud-storage and Hadoop. Azure is growing rapidly and is great for businesses that are already tied into Microsoft technology.|
|SQL Server Parallel Data Warehouse ||SQL Server 2012 Parallel Data Warehouse (AKA Microsoft Analytics Platform System) is Microsoft’s Big Data version of SQL Server that allows SQL Server to run across multiple machines in parallel. Perhaps the most useful element of PDW is Polybase, which allows SQL Server to integrate with Hadoop clusters– making it highly desirable for businesses that need to keep SQL Server but also harness Big Data Processing.|
|Oracle ||Oracle is expanding its range of Big Data technology, although they all come with Enterprise Level price tags. Oracle sells dedicated hardware systems for Big Data, although like Microsoft they are also focusing on ways to plug their traditional databases into Hadoop with products such as Oracle Big Data SQL.|