Apache Spark – Big Data Platform for All

Tweet about this on TwitterShare on LinkedInShare on Google+Share on Facebook

Apache Spark is a powerful open source in-memory cluster computing framework built around speed, ease of use, and sophisticated analytics. It runs everywhere – Hadoop (YARN), Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 and more. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX for graph processing, and Spark Streaming to build scalable fault-tolerant streaming applications. These can also be combined seamlessly in an application.

Spark is engineered from the bottom-up for performance, running 100x faster than Hadoop MapReduce by exploiting in memory computing and other optimizations and it excels at iterative computation. Currently, it’s a top-level Apache project and among the most active ones as well.

Spark is written using Scala but its API comes in many flavour: Scala, Java, Python and now R.

With recently released Data Frame API it brings simplicity to distributed big data processing for everyone. This API is inspired by native data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing API, DataFrames has ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster. For new users familiar with data frames in programming languages like R and Python, this API should make them feel at home.

With the support of Data Frame and with many other simple yet powerful offerings, Spark has a potential to become the de facto platform for data scientists, analysts and developers to play around big data.

About Sumit Mund

Sumit Mund is a big data analytics consultant with about a decade of industry experience. At Mund Consulting Ltd, he is a director and acts as the lead consultant. He is an expert in machine learning, predictive analytics, Apache Spark, Python, C#, R, and Scala; Sumit has an active interest in Artificial Intelligence. He has extensive experience working with most of Microsoft Data Analytics tools and Big Data platforms like HDInsight. He is a Certified Developer on Apache Spark and also Microsoft Certified Solution Expert (MCSE in Business Intelligence). Sumit regularly engages on social media platforms through his tweets, blogs, and LinkedIn profile, and often gives talks at industry conferences and local user group meetings.

Leave a Reply

Your email address will not be published. Required fields are marked *


4 − = two