Thursday, 17 October 2013

Using the HDP Sandbox to Learn Sqoop

Once you have your HDP Sandbox up and running, you can use Sqoop to move data between your Hadoop cluster and your relational database.  Your Hadoop Hive/HCatalog environment uses a MySQL database server for storing metadata, so you can use the built-in MySQL database server to play with Sqoop.  In real life you would not use this specific MySQL database server to play, but I'm going to for this demo.  Credit for this demo goes to Tom Hanlon (a longtime friend and great resource in the Hadoop space).

Be aware, sqoop is not atomic.  After a data load, it is a good practice to do a record count on both sides and make sure they match.

Log into your HDP Sandbox as root to bring up a terminal window (instructions are provided in the Sandbox). The loopback address 127.0.0.1 is a non-routable IP address that refers to the local host.

Demo One:  Move data from a relational database into your Hadoop cluster.  Then use HDFS commands to verity the files reside in your Hadoop cluster on HDFS.
Connect to MySQL using the mysql client, create a database and build a simple table.
# mysql
mysql>   CREATE  DATABASE  sqoopdb;
mysql>   USE  sqoopdb;
mysql>   CREATE  TABLE  mytab (id int not null auto_increment primary key, name varchar(20));
mysql>   INSERT  INTO  mytab  VALUES (null, 'Tom');
mysql>  INSERT  INTO  mytab  VALUES (null, 'George');
mysql>   INSERT  INTO  mytab  VALUES (null, 'Barry');
mysql>   INSERT  INTO  mytab  VALUES (null, 'Mark');
mysql>   GRANT  ALL  ON  sqoopdb.* to root@localhost;
mysql>   GRANT  ALL  ON  sqoopdb.* to root@'%';
mysql>  exit;

-- Sqoop command requires permission to access the database as well as HDFS.
# su - hdfs
$ sqoop import --connect jdbc:mysql://127.0.0.1/sqoopdb --username root --direct --table mytab --m 1

$ hadoop fs -lsr mytab
$ hadoop fs -cat mytab/part-m-00000

Demo Two:  Load data from a relational database into Hive.  Then query the data using Hive.
# mysql
mysql>   USE  sqoopdb;
mysql>   CREATE  TABLE newtab (id int not null auto_increment primary key, name varchar(20));
mysql>   INSERT  INTO newtab VALUES (null, 'Tom');
mysql>   INSERT  INTO newtab VALUES (null, 'George');
mysql>  INSERT  INTO newtab VALUES (null, 'Barry');
mysql>   INSERT  INTO newtab VALUES (null, 'Mark');
mysql>   exit;

# su - hdfs
$ sqoop  import   --connect  jdbc:mysql://127.0.0.1/sqoopdb   --username  root   --table  newtab \
 --direct   --m  1 --hive-import

-- Hive has a command line interface for interfacing with the data. Using the hive metadata, hive users
-- can access the data using a SQL interface.  Person running hive command must have read access in
--  HDFS.
$ hive
hive>  show tables;
hive>  SELECT   *   FROM  newtab;
hive>  exit;
$


-- The physical files will be stored in the HDFS directory location defined by the following property in the /etc/hive/conf/hive-site.xml file.
-- 
--   hive.metastore.warehouse.dir
--    /apps/hive/warehouse
-- 

-- Look at the data location in HDFS.
$ hadoop fs -lsr /apps/hive/warehouse/newtab

-- Look at the data contents.
$ hadoop fs -cat /apps/hive/warehouse/newtab/part-m-00000


-- You can use the following help commands along with the documentation to do a lot of examples moving data between your Hadoop cluster and a relational database using Sqoop.
$  sqoop help
$  sqoop help import

No comments:

Post a Comment