What is Big Data?

What is Big Data? ...

"Big Data" is an unavoidable phrase currently, especially in online marketing circles. After you’ve tried to read the myriads of blogs, guides & whitepapers published daily on this topic you’re left with the impression that Big Data will:

  • Generate you huge amounts of revenue
  • Solve all your business issues
  • Cure Cancer, end world hunger and wars
  • If you don’t hop on the bandwagon soon your business is doomed.

Almost everything you read about big data concerns the exciting application of the new types of data available to organisations in a world where human activity increasingly takes place online.

But what actually is Big Data, and how can you make use of it?

Big Data word cloud

Photo by Camelia.boban

The Definition of Big Data

If you ask a data developer and a marketing expert what big data is, you’ll probably receive two different responses:

The Marketing expert might tell you:

Big Data is the exciting application of datasets increasingly available in the modern world

The techie might say:

Big Data is a methodology of storing and processing data

Depending on your viewpoint either of these definitions could be valid, the first definition could apply to almost anything so in this post we’ll briefly cover what Big Data is from a technical perspective (while trying not to get too technical).

When does data become BIG?

So is 1 million rows of data BIG? How about a billion, or perhaps a trillion?

This is a logical question, but there is no strict threshold, since from a technical viewpoint Big Data is really more about the platform and methodology than just pure numbers.

One commonly cited definition of when data becomes BIG are the “Three V’s”:

Volume When the size of the data becomes impractical on traditional database systems
Velocity When traditional database systems struggle to process and move data quickly enough
Variety When you want to use messy and varied data not acceptable to traditional database systems

 

Now let’s look at what a “traditional” method of storing data is, and then compare it to Big Data techniques.

The Traditional Relational Database

The relational database management system (RDBMS) is the most popular way of storing and processing data in the world today. The chances are that almost every software, business tool and website you encounter in your day to day life will use this type of database in some way to store and retrieve data.

Relational databases are all about structured data, where you carefully organise and divide your data into tables and columns and define strict relationships between each entity (known as normalisation). A key characteristic of a RDBMS is that any data it stores has to be very clean and conform to strict pre-defined rules; if data doesn’t meet these constraints the concept falls apart.



Diagram of Relational database design

Traditional Databases require a rigid structure of tables and fields

 

This inflexibility is the RDBMS's greatest strength, it makes them reliable, auditable, extremely secure, and they always provide the systems that talk to them with consistent reliable data.

Relational databases can deal with huge quantities of data, however their characteristics mean they don’t scale easily without a lot of time, money and expertise. Also, due to their strict requirements they are not suited to processing data that might contain a lot of junk or is unpredictable in nature.

So how do these traditional databases stack up against the “Three V’s?” – not particularly well, especially if you want to keep costs down.

If you have an awful lot of unstructured data to store and process, you’re going to need something with the capabilities.

Big Data Technology – Many Hands Make Light Work

If you wanted to summarise Big Data methodology into a single word we would define it as “Parallelism”.

Big data technology differs from traditional RDBMS’s in that it’s designed around the whole concept of dividing the work and storage up between as many de-centralized machines as you wish.

The most popular big data platform is Hadoop which uses its own distributed /wp-content/uploads/file system for storing data, and a method known as MapReduce for processing it. MapReduce works around the concept that a processing job is divided up again and again, often down many machines (or nodes). If you want to store more data or process it faster, you just add more boxes with no extra development work.



diagram of Hadoop method

Hadoop divides both storage and processing into parallel nodes

 

Big Data platforms are relatively cheap, with the most popular platforms being open-source, all you have to worry about it the hardware and paying the developers.

In fact even those costs can be cut by renting resource from services like Amazon Web Services, or paying for the increasing number of new software tools that present a user-friendly interface over the top of this technology.

All this makes Big Data reasonably accessible to anyone who has the inclination.

Practical Uses of Big Data

Having covered the technical bit, how are the real-world applications relevant to a business that operates online?

Uses of Big Data generally fall into 2 categories:

 

Analytics

Getting insights and visualisations efficiently from huge, varied datasets to inform business decisions. A great example of this can be seen with T-Mobile reducing customer churn.

Processing

Using Big Data technology to power a functional feature or solve a technical challenge. A good example of this can be seen with the way Amazon has been increasing the personalisation of web content for the last few years.

"Thick Data" and "Thin Data"

There is also further discussion to be had over “thick and thin” data. This is the way we like to distinguish between data sets at equimedia.

Thick Data is how we refer to data that has a large number of variables and facts of real value appended to each data point (for example customer databases that collect personal contact information, financial information, specific user preferences etc.).

Thin Data is more typically associated with non-personally identifiable digital tracking that are of little value on their own, but when aggregated can give huge analytical value (impressions, clicks, page views etc.).

The key challenge is to unify the thick and thin data so you are aggregating the huge number of “thin” behavioural touch points with the more personal “thick” data. We are currently working on a number of projects with Clients to combine this data and increase the value of their data overall. These projects may not be officially classed as “Big Data” from a technical perspective, but they will certainly deliver additional revenue for these Clients.

 


 

Big Data Jargon Buster

The following section covers some of the most common products in the world of traditional and big data databases:

Map Reduce
Map Reduce logo
Pioneered by Google to power its search engine, MapReduce is a method of processing huge amounts of data by efficiently organising and dividing up the work across scalable machines (or Nodes).

Each problem is divided up into smaller tasks (Map) and the results from the various nodes are collected (Reduce) to provide the solution.

A popular implementation of the MapReduce method is Hadoop.
Hadoop
Hadoop logo
Pioneered by employees at Yahoo, Hadoop is the most popular open-source implementation of the MapReduce method. It is written in a programming language called Java.

Hadoop refers to an ecosystem of tools and technology including the MapReduce part for processing and the Hadoop Distributed File System (HDFS) for data storage.

Pretty much all the main online brands use Hadoop to some extent – and it is often only part of their Big-Data solution, being plugged into NoSQL databases or traditional relational databases to get the data where it needs to be.

Note that Hadoop isn’t the final word in open-source Big Data processing, it has its strengths and weaknesses and has been around since 2005. Alternative systems are emerging from the basements of the big social networks that may someday overtake Hadoop, such as Google’s BigQuery, Giraph, Amazon Redshift and Twitter Storm.
NoSQL Databases
NoSQL logos
NoSQL (Not only SQL) refers to a broad collection of scalable databases. Their emphasis is on scalable storage, fast read/write times and looser structure which makes them ideal for Big Data purposes.

They are hugely varied in technology, although the one characteristic they share is that they all are completely different from traditional relational databases.

Some well-known NoSQL databases to be aware of include Hbase, MongoDB, Couchbase, Dynamo and Neo4j.
Graph Database
Graph Database logos
Graph databases fall under the “NoSQL” umbrella and they store data as nodes and edges, rather than tables and columns. They are of particular note due to the rise of Social and Big Data.

Graph databases are particularly suited to storing any data relating to social, connections, nodes and relationships.

Graph databases are used by all the major search engines and social networks, but they are increasingly making their way into the business world, powering recommendation engines and the social elements of retail websites. The most popular Graph databases include Neo4j and Twitter’s own FlockDB. The rise of social-like data is leading to the rise of alternatives to Hadoop such as Twitter Storm and Giraph which are ideally suited to processing graph data.
Amazon Web Services
Amazon Web Services logo
AWS is by far the most widely used cloud service for Big Data. AWS refers to a vast range of services that all stick to the concept that you pay for what you need. AWS integrates tightly with Hadoop.

A great example of AWS in use would be The New York Times processing 4TB of images into 11 million PDF’s using 24 hours of AWS time, at a cost of just $240.

AWS makes Big Data technology incredibly accessible to small businesses and individuals.

The concept of cloud computing that AWS pioneers has also led to the rise of user-friendly Big-Data analytics packages that makes use of the cloud behind the scenes to allow non-techies to query big data without investing in hardware.
Microsoft Azure
Microsoft Azure logo
This could be thought of as Microsoft’s version of AWS. It doesn’t offer the same breadth of Amazon’s services but aims to provide similar big-data related platforms such as cloud-storage and Hadoop. Azure is growing rapidly and is great for businesses that are already tied into Microsoft technology.
SQL Server Parallel Data Warehouse
SQL Server Parallel Data Warehouse logo
SQL Server 2012 Parallel Data Warehouse (AKA Microsoft Analytics Platform System) is Microsoft’s Big Data version of SQL Server that allows SQL Server to run across multiple machines in parallel. Perhaps the most useful element of PDW is Polybase, which allows SQL Server to integrate with Hadoop clusters– making it highly desirable for businesses that need to keep SQL Server but also harness Big Data Processing.
Oracle
Oracle logo
Oracle is expanding its range of Big Data technology, although they all come with Enterprise Level price tags. Oracle sells dedicated hardware systems for Big Data, although like Microsoft they are also focusing on ways to plug their traditional databases into Hadoop with products such as Oracle Big Data SQL.



Contact Us

Do you have a challenge for us to solve?

get in touch