Following are different data platforms available as of now in Microsoft Azure.
Azure Storage is the cloud storage solution for modern applications that rely on durability, availability, and scalability to meet the needs of their customers. A standard storage account gives you access to Blob storage, Table storage, Queue storage, and File storage.
The Azure Data Lake Store is a cloud repository where you can easily store data of any size or any type. It is the Hadoop Distributed File System for the cloud and available on-demad. Data stored in Data Lake Store is easily accessible to Azure Data Lake Analytics and Azure HDInsight. It will be possible to integrate it with other Hadoop distributions and projects like Hortonworks , Cloudera, spark, strom and flume.
Below are the steps to create Azure Data Late Store and manage it using Azure Portal and Azure CLI.
What is an Enterprise Data Lake?
Way back in 2010, Pentaho co-founder and CTO, James Dixon coined the term ‘Data Lake’. While these days, there exist many interpretations of the term, usually it means a repository that holds a vast amount of raw data in its native format until it is needed. Raw data at its most granular level is stored so that any ad-hoc analysis can be performed at any time.
Apache Spark is a powerful open source in-memory cluster computing framework built around speed, ease of use, and sophisticated analytics. It runs everywhere – Hadoop (YARN), Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 and more. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX for graph processing, and Spark Streaming to build scalable fault-tolerant streaming applications. These can also be combined seamlessly in an application.
To develop Apache Spark applications in IPython and Python tools for Visual Studio we need to set the environment variables PYTHONPATH to include the required library path for Spark.
Setting PYTHONPATH for Spark
- Go to system properties and in advance tab click on environment variables.
- Create a new system variable and name it as PYTHONPATH.
- Add the below paths to the value field separated by semicolons (here c:\spark-1.3.0 is the path where spark installed)
- Create another system variable and name it as SPARK_HOME. Set the value as the path of Spark installed directory.