A sample to integrate data from Hadoop (HDInsight) using SSIS

Hadoop mostly deals with unstructured data. And all your structured data lives in relational databases. After you made necessary processing it on the Hadoop cluster you may need to bring your analysis to your data warehouse or to your RDBMS tables for further analysis so that unstructured data could compliment to structured database.

As I was playing around with HDInsight (Microsoft’s implementation of Apache Hadoop) on Azure I thought it will be useful to compile a step by step guide to integrate data from Hadoop cluster (HDInsight) using SQL Server Integration service.

Here I will be importing a simple key-value pairs from Hadoop cluster to a SQL Server table with two columns on the local machine.

Prepare what to import from Hadoop

Here I will be importing the analysis of the most common demo of a Hadoop job – Word Count. I am using HDInsight Preview and the WordCount sample comes as one of the default samples. After the job has been deployed I created an external table in Hive for the top ten words based on count.

If you are a newbie and just want to get started with HdInsight on Azure from scratch then follow the step by step guide: http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/. This link also guides you to deploy the WordCount sample job.

Create the Hive table

(Ref: http://www.windowsazure.com/en-us/manage/services/hdinsight/interactive-javascript-and-hive-consoles/#header-3)

  1. Click Hive on the upper right to open the Hive console.
  2. Enter the following command to create a two column table named DaVinciWordCountTable from the WordCount sample output that was saved in the “DaVinciTop10Words” folder:
CREATE EXTERNAL TABLE DaVinciWordCountTable
(word STRING,count INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE LOCATION ‘/user/admin/DaVinciTop10Words’;

Replace admin with the login username.

Note that the table is created as an EXTERNAL table to keep the folder targeted independent of the table. Also, note that you need only specify the folder in which the output file is located, not the file name itself.

  1. Click EVALUATE on the lower left.
  2. Enter the following commands to confirm that the two column table has been created:
SHOW TABLES;
DESCRIBE DaVinciWordCountTable;
  1. Click EVALUATE.

Then test if the external table is working fine by passing SQL query into it.

  1. Run the following command to query for the words with the top ten number of occurrences:
SELECT word, count
FROM DaVinciWordCountTable
ORDER BY count DESC LIMIT 10

The results of this query are:

SELECT word, count FROM DaVinciWordCountTable ORDER BY count DESC LIMIT 10
the 22966
of 11228
and 8428
in 5737
to 5296
a 4791
is 4261
it 3073
that 2903
which 2544

Now your data in the Hadoop is ready to be imported.

Create destination table in SQL Server Database

I just created a simple a table in a SQL Server database named HadoopData with following schema.

 

ODBC Driver for HDInsight

You need to download and install ODBC driver for HDInsight from the following link:

http://www.microsoft.com/en-us/download/details.aspx?id=37134

SSIS Package to Import data
  1. Create an Integration Service project using SQL Server data tools (or BIDS).

  1. The solution looked as below for me.

  1. Add a connection manager for Hadoop Hive (your source data!)
  1. Right Click on the Connection manager panel and choose New Connection.

  1. Choose ODBC and click Add.

  1. On Configure ODBC Connection Manager Click New and then Select Use Connection String on Connection Manager.

  1. On Select Data Source choose Machine Data Source tab and click New. Then on Create New Data Source choose System Data Source and click Next.

  1. Then choose HIVE and click Next and then Finish on the following screen.

  1. On Hive Data Source Configuration, Type a data source name. Also Host – This is the Hadoop cluster you have e.g. myhadoop.azurehdinsight.net. Finish it with user name and password to access your Hadoop cluster.

  1. Click OK to close the screens. Then test for the connection. Finally click OK to finish on Connection manager to finish the creation of connection manger to Hadoop data.

  1. Similarly, create a connection manager for the local database table.

  1. Add a control flow task and then add an ODBC Source and ADO.NET destination.

  1. Configure Hadoop External Table (Source) to use MyHadoopHiveDS connection manager. And also write SQL query to pull data from Hive table. Refer pic below.

  1. Configure destination (SQL DB Table) to write values to SQL Server data table.

  1. Run the package.

  1. Now go to the local DB table to check if data has written successfully.

Sumit Mund

Data Solution Architect with more than 15 years of hands-on experience. He has an MSc by Research (in AI) degree and B.Tech degree in Information Technology. He is also a part-time PhD scholar at the University of Huddersfield where his research area includes applications of AI in Finance, particularly in Risk Management (Hedging). Read More...

Leave a Reply

Close Menu