Apache Spark is a powerful open source in-memory cluster computing framework built around speed, ease of use, and sophisticated analytics. It runs everywhere – Hadoop (YARN), Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 and more. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX for graph processing, and Spark Streaming to build scalable fault-tolerant streaming applications. These can also be combined seamlessly in an application.
Spark is engineered from the bottom-up for performance, running 100x faster than Hadoop MapReduce by exploiting in memory computing and other optimizations and it excels at iterative computation. Currently, it’s a top-level Apache project and among the most active ones as well.
Spark is written using Scala but its API comes in many flavour: Scala, Java, Python and now R.
With recently released Data Frame API it brings simplicity to distributed big data processing for everyone. This API is inspired by native data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing API, DataFrames has ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster. For new users familiar with data frames in programming languages like R and Python, this API should make them feel at home.
With the support of Data Frame and with many other simple yet powerful offerings, Spark has a potential to become the de facto platform for data scientists, analysts and developers to play around big data.