Big data may no longer be a buzzword – the concept of big data, creating big data stores and analysing against these effectively is common practice. However, big data is still a bit of a ‘fuzzword’, as I like to call it, because there are no industry-accepted or concrete definitions to tick off and go through and finally say: ‘I have big data.’
Lately, I’ve been on a quest to try to understand what this monster is and what would be the key factors that would enable a successful big data implementation.
After having many discussions about the topic of big data with Entelect’s CTO, Martin Naude, I decided to visit my go-to website, Wikipedia, for a ‘tidy’, encyclopaedic definition. According to Wiki, the concept of big data has been around since 2001. Initially, there were three Vs that defined big data: volume, velocity and variety.
However, this is where the main dilemma begins. No one can define any of these Vs for me. There is no concise definition of what each one means or entails. So after all the research and more discussions, I have decided to throw the dictionary (and Wiki) out of the window on this one and come up with simpler, more realistic definitions of all the Vs and, at the same time, hopefully make sense of big data.
While researching the ‘volume’ side of big data, I found quite a few definitions. Some say the data must be terabytes, petabytes and even larger. However, ‘volume’ was set as a pillar back in 2001, and when I look back on my career, in the early 2000s, if someone had 100 gigabytes of data in a data store, he was on the bleeding edge and would probably be asked to be a keynote speaker at a conference to explain how he actually got it right.
Today, we are in a more privileged environment. For just $3500 we can buy a small footprint hard drive that can store all the music that the world has ever produced. With the uptake of solid state drives, we are able to equip a server that has the ability to store and access tons of data without breaking the bank. So, my definition of volume is more around the ability to store and access large volumes of data, and not purely a focus on the amount of data that constitutes big data.
‘Velocity’ refers to rapidly changing data that is processed at a very quick rate. It won’t come as a shock to find that I have an issue with this one too! Where is the line in the sand to say we are processing quickly: is it six gigabytes a second as in the case of the Large Hadron Collider, is it 1000 transactions a second, or is it once a month?
I believe that velocity is less of an important consideration and rather that it is the ability to process, analyse and output the transactions at a required rate over a specific required timeframe. For example, transactional fraud detection is important and we may need near-real-time algorithms to run against the transactions processing, otherwise, our risk exposure would increase dramatically. For geofencing applications we need near-real time too. However, if we want to analyse a debtors’ book, an acceptable time of a day or two to run certain scenarios and put plans in place for results processing is fine.
When we look at velocity, we also need to take into account several external factors. These include infrastructure and internet speeds, storage ability and processing speeds. This is where software meets hardware and, fortunately, we are in the middle of a software boom: new data storage engines are being released every month and we now have the likes of Mongo, NewSQL, NoSQL, Hadoop and many more that challenge our preconceptions of storing data in non-traditional formats.
‘Variety’ is the one ‘V’ that has really matured over the last decade. We now have the Internet of Things (IoT) and in sensor-driven manufacturing, we have the ability to pull in social media information to take our understanding of our customer to the next level, and we also have many more lines of business systems that we did 10 years ago. This is where the real challenge starts from a business intelligence (BI), analytics and big data point of view. We can no longer pick a technology and run with it.
Historically, organisations were a Microsoft or an Oracle outfit, but they are now faced with the challenge of running side-by-side appliances that are appropriate for the relevant variety of data they are attempting to consolidate and relate. Today, organisations need to start diversifying from a single-vendor solution (although I don’t think believe traditional relational database will ever die), to face a new challenge of taking that traditional relational data and relating it to unstructured data (for example, on a Hadoop appliance) and yet still provide the results through a mechanism that business users can understand while remaining intuitive.
I am not sure what happened in 2012, but another V was added to the list: ‘Veracity’. This means that the data needs to conform to the truth and the data has to be trusted. This, to me, brings everything back together and forms the foundation for a big data implementation. In everything we are doing, we need to ensure that what we are providing to business conforms to the truth and can be trusted. If we look up some successful big data applications and implementations, we just need to ask the question: ‘What if the data was wrong?’ and we will quickly understand the importance of this. When President Obama ran his micro-targeting campaigns, there was very little room for error and the data had to be correct. This is also the case with Amazon.com, which is trialling delivering goods before customers order them. To do this effectively, the company has written such great algorithms to predict what customers are going to order (and when they will order), that if the data is only 90% accurate, the company stands to lose a lot of money. Then we have Uber, a company that is setting a benchmark for others, innovating in several industries as not just another taxi service. Uber published the algorithm for predicting a client’s destination and can now begin to offer car-pool services. The value of accurate and correct data is easily understood.
With all these Vs in place, there seem to be many more marketable Vs that can easily be strapped on to the existing foundation. ‘Visualisation’ is important: Can we provide business with the information in a format that makes sense? Businesses are spoilt with the number of technologies available to use for a presentation layer. However, this also brings another conundrum: Which visualisation tool do they use? And this is where we start balancing cutting edge with bleeding edge. At least once a month, I speak to a potential client who mentions a new visualisation tool and asks if we at Entelect have expertise in it. Each tool has its own pros and cons, and here is where we need to take a step back and define what functionality we are actually looking for in a visualisation tool.
I have done countless implementations where self-service BI was a hard requirement, so the tools were selected and the universe of data provided. Then, however, when I checked in a few months later, the company had employed report writers because it didn’t actually want to write its own reports or its IT department was experiencing headaches because everyone was applying different filters to the data and expecting the same result. Do we actually need interactive charts and graphs, with drill down and drill through, bottom-up and top-down reporting? For many, these requirements are a must, until you mention the price of the potential tool, then they quickly move to a show value first and add the sexy visualisations later. I have never seen a successful revenue assurance implementation run on anything other than raw data. What I am trying to emphasise is that companies need to keep their short-term goals in mind, and before they fork out a lot of money on a visualisation tool, make sure that the path they are going down is going to add value to the business.