Needless to mention that Apache Spark is becoming the de facto platform for big data analytics. At the same time there is a notebook revolution going on. Data scientists and others who use a notebook simply love it. A notebook provides a browser based interactive environment to write and execute code, view output, make plots […]
What is an Enterprise Data Lake? Way back in 2010, Pentaho co-founder and CTO, James Dixon coined the term ‘Data Lake’. While these days, there exist many interpretations of the term, usually it means a repository that holds a vast amount of raw data in its native format until it is needed. Raw data at […]
Apache Spark is a powerful open source in-memory cluster computing framework built around speed, ease of use, and sophisticated analytics. It runs everywhere – Hadoop (YARN), Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 and more. Spark powers a stack of high-level tools including Spark SQL, […]
Yesterday, I delivered a short presentation on R support for Microsoft Azure Machine Learning at ManchesterR user meeting. Below is the PPT embedded from slide share.
When you train a machine learning algorithm it is very important that you choose right set of parameters. When you don’t understand the in-side out of that algorithm it might be very difficult to choose and fine tune the parameters. Even if you understand the algorithm well, it might be daunting to run different iteration […]