The concept of Big Data is far beyond the hype cycle and there are already interesting things that are happening in terms of how these projects are shaping up within the enterprise.
One of the sessions that were covered at Microsoft TechEd 2013 revolved around building a Big Data practice by using hosted cloud solutions to have maximum output while still being cost efficient.
And that is the beauty of the public cloud. Start provisioning and start processing. The wait is over!
Raja Venkatesh, one of the founders of Qubole, a company that offers clients consultative services to enable Big Data in a hosted cloud environment, was talking about just that. The next few paragraphs will talk about how CIOs can plan to utilise their internal clouds or public cloud services that they are subscribed to, in order to get cost effective data analytics going.
In order to understand where one can do things differently, one needs to understand how Big Data is set up internally in enterprises in the first place. What typically happens is that the IT organisation decides to spend some of the allocated budget on buying some hardware and software (possibly Hadoop or Apache Hive). Naturally, one looks next at hiring engineers that can start building an application that IT can maintain. Conventionally enterprises built massive data warehouses that were extremely difficult to build and maintain. Today, with the growth and evolution of technologies like the open source project Hadoop, it has become much easier to run algorithms that can take advantage of multiple processing units (virtual and physical).
Once CIOs have set up the hardware and systems, they start to collect data over a period of time. The important thing to remember is that “Smaller data sets need more complex algorithms, while larger data sets can be quizzed for data using slightly simpler algorithms,” explains Venkatesh.
Typically, there are intangible costs involved. So while provisioning hardware, provision for three years into the future. It will take time to put the application into production – CIOs must remember that RoI is not immediate.
Now, with something like Hadoop for example, the cost of processing petabytes of data comes down immediately by reducing the cost of the hardware (since it can be distributed across a number of machines). Commodity hardware is not reliable, so if you’re using hundreds of machines, one or two will fail and you should have a system in place that will handle these failures. Hadoop has distributed file system capabilities and map reduce functions that will automatically distribute over your servers and prevent you from worrying about things like fault tolerance, server fails and so on. Writing map reduce is very tough, so it is important to remember that some talent will have to be acquired to figure this out.
Now another scenario that pops up is that there isn’t petabytes of data, but only a few TB and constrained budgets to process the data. In this case, public cloud fits the bill.
A cloud service provider will allow CIOs to provision as many machines as needed, and once the team is set up, they can start running some test algorithms in the production environment using Hadoop and Apache Hive. The cost shouldn’t go above a few hundred dollars.
Data analysis in a few hundred dollars? It is hard to believe but it is not a distant reality anymore. Once the algorithms are in place, the business can get a number of questions answered using the data.
Many big companies can provision a number of machines and deploy applications on them to build their own Big Data environment. Whether the company is using a cloud or using their own infrastructure, the gains and cost benefits of building own cloud using open technologies like Hadoop or Apache Hive can go a long way in gaining actionable insights. “CIOs will be surprised at the number of providers that are offering Hadoop-as-a-Service,” shares Venkatesh.
According to Venkatesh, there lies a disconnect between acquiring hardware and software to build the production environment and building the application and implementing it in production. After CIOs get the hardware (or provision it on the cloud for massive cost savings) they now need to figure out how to reduce the cost in terms of time and resources to implement an algorithm.
Services like Qubole help provision the infrastructure to help deploy applications faster. Given the fact that companies like Microsoft are supporting vendors like this, speaks a lot about the company and the ecosystem it is trying to build.