Three Big Data Certs that can Give Your IT Career a Spark

Needlestack

Every click, tweet, search, and post you make generates data. Welcome to the information age.

 

Over the past 30 years, information has become so easy to collect and disperse that we've gone from a society scrambling to find information to a society that scrambles to sort it. Our ever-increasing connectivity means that a constant flood of information bogs down our bandwidth.

 

There's so much of it that even the most dedicated team of able professionals couldn't possibly process and sort all of it. This is the world of "Big Data," where it's not about finding a needle in a haystack, it's about finding a needle in a needlestack in a field of needlestacks.

 

The metaphors break down, but you get the point. Let's talk Spark.

 

Spark is a cluster-computing framework that allows multiple computers to work together for processing large data sets. It was developed by the University of California, Berkeley's AMPLab from 2009 to 2013, after which it was donated to the Apache Software Foundation. Shortly after that, it became a top-level Apache project.

 

Coordinating data among multiple machines not only allows for storing and processing larger data sets, but Spark also has built-in fault tolerance, so the failure of any single node won't result in a loss in data. While Spark isn't the only framework of this kind out there, it's arguably the most useful, and the ability to use it is in high (and growing!) demand.

 

To be clear, Spark doesn't actually handle the filing and storage of the data—just processing and analytics. Running Spark properly requires a distributed storage system, and the most common solution used is Hadoop, Apache's data-filing framework. Hadoop does a wonderful job of sorting, classifying, and filing data in a cluster-computing scenario, but it doesn't have Spark's analytic speed.

 

Hadoop can't, for instance, analyze heavy data streams in real time, and it doesn't have the machine-learning capabilities that Spark does. Using Hadoop and Spark together, however, one can process large amounts of data quickly, and in real time.

 

The power of this combination has led many well-known companies to use it. The breakout taxi-dispatch service Uber collects terabytes of data from its users every single day, and uses Spark to turn the massive amounts of inbound information into organized, usable data in real time.

 

Pinterest uses Spark streaming to instantly recognize the relevance and trends of their thousands and thousands of pins, which allows them to make real-time recommendations to clientele. And the video streaming site Conviva uses Spark to optimize video streams, resulting in consistent, uninterrupted viewing for their users.

 

Spark is already in fairly widespread use, but we're right around the corner from the anticipated Internet of Things, a time when every appliance you own will be sending information to their various regulatory servers constantly. Big data is only going to get bigger and the demand for fast, real-time analytics frameworks like Spark is only going to get higher.

 

If you haven't already considered Big Data as a career option, now might be the time.

 

How to Get Started

 

Man using laptop computer

If you're looking to get into Spark or related big data gigs, then you've basically got two options; you can transition into it, or you can study into it.

 

Transitioning is probably your best bet, but it also has the most prerequisites. To make the switch to Spark, you need to already be working in a related field. Obviously, experience with Hadoop means you're already halfway there.

 

Yet while Hadoop is obviously far-and-away the most popular Spark integration, Spark can also interface with Amazon S3 and Cassandra. If you have experience with either of those platforms, you might have an edge on the competition.

 

Spark is only a framework with an API, so coding skills are a must. If you're not already working in a field where coding ability is required, you should probably dust off your keyboard and refresh yourself, and if you've never learned to code anything ... then ... well, you should probably look into that.

 

The languages of the day are Java, Python, Scala, and R, so if you happen to be proficient in any of those, then you're already ahead of the curve.

 

Studying into Spark is tricky. Obviously, you'll want to be proficient in Scala, and you should definitely familiarize yourself with Java and Python as well. You're also going to want to make sure you know your SQL. And if you don't already have experience in the field, you're going to want to pump your resume up with everything you can to prove you have the aptitude for it

 

In other words, look into certifications.

 

When it comes to Spark, there's a sort of trinity, all offered by Cloudera. While you can find other certs, most won't carry as much weight as Cloudera with anybody who's familiar with the field, and the Cloudera certs are probably the only ones that prospective employers will ask for by name. All three of the certifications listed below are administered remotely, over your internet connection and webcam.

 

Cloudera Certified Administrator for Apache Hadoop (CCAH)

 

Use the CCAH to show that you know how to deploy and maintain a Hadoop cluster. The exam is 90 minutes, 60 questions, and costs $295. This certification is version-specific, and applicants are encouraged to re-certify for new versions.

 

Cloudera Certified Professional: Data Scientist (CCP:DS)

 

This one's a good deal tougher. Three exams, one challenge each, and applicants are given eight hours to complete each exam. All three exams must be completed within 365 days of each other. Once the certification is earned, it is good for three years. During the exams, you'll be expected to demonstrate your ability to extract relevant features from large data sets in a variety of formats.

 

To give you an idea, the exams are called "Descriptive and Inferential Statistics on Big Data, Advanced Analytical Techniques on Big Data, and Machine Learning at Scale" in that order. Each exam is $600

 

CCA Spark and Hadoop Developer Certification

 

If you've got your Scala and Python up to par and want to prove it, this is the certification to do so! In the two-hour exam, you'll be given 10 to 12 hands-on tasks, some of which will require coding and some of which will require being able to work with tools like Hive and Impala. The exam costs $295.

 

Happy digging!

 

MORE HISTORIC HACKS
Would you like more insight into the history of hacking? Check out Calvin's other articles about historical hackery:
About the Author
David Telford

David Telford is a short-attention-span renaissance man and university student. His current project is the card game MatchTags, which you can find on Facebook and Kickstarter.