Apache Spark – Big Data Platform for All

Apache Spark is a powerful open source in-memory cluster computing framework built around speed, ease of use, and sophisticated analytics. It runs everywhere – Hadoop (YARN), Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 and more. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX for graph processing, and Spark Streaming to build scalable fault-tolerant streaming applications. These can also be combined seamlessly in an application.

Spark is engineered from the bottom-up for performance, running 100x faster than Hadoop MapReduce by exploiting in memory computing and other optimizations and it excels at iterative computation. Currently, it’s a top-level Apache project and among the most active ones as well.

Spark is written using Scala but its API comes in many flavour: Scala, Java, Python and now R.

With recently released Data Frame API it brings simplicity to distributed big data processing for everyone. This API is inspired by native data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing API, DataFrames has ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster. For new users familiar with data frames in programming languages like R and Python, this API should make them feel at home.

With the support of Data Frame and with many other simple yet powerful offerings, Spark has a potential to become the de facto platform for data scientists, analysts and developers to play around big data.

Sumit Mund

Sumit Mund is an Artificial Intelligence Consultant with more than 12 years of experience. He has an MSc by Research degree and B.Tech degree in Information Technology. He is also a part-time PhD scholar at University of Huddersfield where his research area includes applications of Deep Reinforcement Learning and uses Google Tensorflow extensively. Read More...

Leave a Reply

Close Menu