Tuesday, 31 December 2013

HBase Examples

Go to HBase Mode
          $hbase shell

List all the tables

Create HBase table with Normal Mode
          hbase>create ‘cars’, ‘vi’

Let’s insert 3 column qualifies (make, model, year) and the associated values into the first row (row1).
          hbase>put ‘cars’, ‘row1’, ‘vi:make’, ‘BMW’
          hbase>put ‘cars’, ‘row1’, ‘vi:model’, ‘5 series’
          hbase>put ‘cars’, ‘row1’, ‘vi:year’, ‘2012’

Now let’s add second row
          hbase>put ‘cars’, ‘row2’, ‘vi:make’, ‘Ferari’
          hbase>put ‘cars’, ‘row2’, ‘vi:model’, ‘e series’
          hbase>put ‘cars’, ‘row2’, ‘vi:year’, ‘2012’

Now let’s add third row
          hbase>put ‘cars’, ‘row3’, ‘vi:make’, ‘Honda’
          hbase>put ‘cars’, ‘row3’, ‘vi:model’, ‘f series’
          hbase>put ‘cars’, ‘row3’, ‘vi:year’, ‘2013’

Sacn the table
          hbase>scan ‘cars’

The next scan we’ll run will limit our results to the make column qualifier.
          hbase>scan ‘cars’, {COLUMNs=>[‘vi:make’]}

1 row to demonstrate how LIMIT works.
          hbase>scan ‘cars’, {COLUMNS =>[‘vi:make’], LIMIT => 1}

We’ll start by getting all columns in row1.
          hbase>get ‘cars’, ‘row1’

You should see output similar to:
COLUMN                   CELL
vi:make                 timestamp=1344817012999, value=bmw
vi:model                timestamp=1344817020843, value=5 series
vi:year                 timestamp=1344817033611, value=2012

To get one specific column include the COLUMN option.
         hbase>get ‘cars’, ‘row1’, {COLUMNS => ‘vi:make’}

You can also get two or more columns by passing an array of columns.
         hbase>get ‘cars’, ‘row1’, {COLUMNS => [‘vi:make’, ‘vi:year’]}

Delete a cell (value)
         hbase>delete ‘cars’, ‘row2’, ‘vi:year’

Let’s check that our delete worked
         hbase>get ‘cars’, ‘row2’

You should see output that shows 2 columns.
vi:make   timestamp=1344817104923, value=mercedes
vi:model   timestamp=1344817115463, value=e class 2
row(s) in 0.0080 seconds

Disable and drop tables
>disable ‘cars’

>drop ‘cars’

Exit the table

HBase Shell Commands

Show the current hbase user.
         hbase> whoami

Alter column family schema;  pass table name and a dictionary specifying new column family schema. Dictionaries are described below in the GENERAL NOTES section.  Dictionary must include name
of column family to alter.
For example,

 To change or add the 'f1' column family in table 't1' from defaults to instead keep a maximum of 5 cell VERSIONS, do:
           hbase> alter 't1', {NAME => 'f1', VERSIONS => 5}

To delete the 'f1' column family in table 't1', do:
           hbase> alter 't1', {NAME => 'f1', METHOD => 'delete'}

 You can also change table-scope attributes like MAX_FILESIZE

 For example, to change the max size of a family to 128MB, do:
           hbase> alter 't1', {METHOD => 'table_att', MAX_FILESIZE => '134217728'}

Count the number of rows in a table. This operation may take a LONG time (Run '$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount' to run a counting mapreduce job). Current count is shown every 1000 rows by default. Count interval may be optionally specified.

           hbase> count 't1'
           hbase> count 't1', 100000
           hbase> t.count INTERVAL => 100000
           hbase> t.count CACHE => 1000
           hbase> t.count INTERVAL => 10, CACHE => 1000

Create table; pass table name, a dictionary of specifications per column family, and optionally a dictionary of table configuration. Dictionaries are described below in the GENERAL NOTES section.

           hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
           hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
           hbase> # The above in shorthand would be the following:
           hbase> create 't1', 'f1', 'f2', 'f3'
           hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, \
             BLOCKCACHE => true}

Describe the named table
        e.g. "hbase> describe 't1'"

Put a delete cell value at specified table/row/column and optionally timestamp coordinates.  Deletes must match the deleted cell's coordinates exactly.  When scanning, a delete cell suppresses older versions. Takes arguments like the 'put' command described below
          hbase> delete ‘t1′, ‘r1′, ‘c1′, ts1

Delete all cells in a given row; pass a table name, row, and optionally a column and timestamp
Delete all cells in a given row; pass a table name, row, and optionally a column and timestamp.
          hbase> deleteall ‘t1′, ‘r1′
          hbase> deleteall ‘t1′, ‘r1′, ‘c1′
          hbase> deleteall ‘t1′, ‘r1′, ‘c1′, ts1
The same commands also can be run on a table reference. Suppose you had a reference t to table ‘t1′, the corresponding command would be:
          hbase> t.deleteall ‘r1′
          hbase> t.deleteall ‘r1′, ‘c1′
          hbase> t.deleteall ‘r1′, ‘c1′, ts1

Disable the named table:
            e.g. "hbase> disable 't1'"

Disable all of tables matching the given regex
            hbase> disable_all ‘t.*’

Drop the named table. Table must first be disabled.
               hbase> drop ‘t1′

Drop all of the tables matching the given regex
              hbase> drop_all ‘t.*’

Enable the named table
              hbase> enable ‘t1′

Enable all of the tables matching the given regex
             hbase> enable_all ‘t.*’

verifies Is named table enabled
            hbase> is_enabled ‘t1′

Does the named table exist?
             e.g. "hbase> exists 't1'"

Type "hbase> exit" to leave the HBase Shell

Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions.  

           hbase> get 't1', 'r1'
           hbase> get 't1', 'r1', {COLUMN => 'c1'}
           hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
           hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1}
           hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, \
             VERSIONS => 4}

List all tables in hbase
           hbase> list
           hbase> list ‘abc.*’

Put a cell 'value' at specified table/row/column and optionally timestamp coordinates.  To put a cell value into table 't1' at  row 'r1' under column 'c1' marked with the time 'ts1', do:
           hbase> put 't1', 'r1', 'c1', 'value', ts1

Listing of hbase surgery tools

Scan a table; pass table name and optionally a dictionary of scanner specifications.  Scanner specifications may include one or more of the following: LIMIT, STARTROW, STOPROW, TIMESTAMP, or COLUMNS.  If no columns are specified, all columns will be scanned.  To scan all members of a column family, leave the qualifier empty as in 'col_family:'.  

           hbase> scan '.META.'
           hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
           hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, \
             STARTROW => 'xyz'}

 For experts, there is an additional option -- CACHE_BLOCKS -- which switches block caching for the scanner on (true) or off (false).  By default it is enabled.  

           hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}

Show cluster status. Can be 'summary', 'simple', or 'detailed'. The default is 'summary'. 

           hbase> status
           hbase> status 'simple'
           hbase> status 'summary'
           hbase> status 'detailed'

Shut down the cluster.

Disables, drops and recreates the specified table.
          hbase>truncate ‘t1′

Output this HBase version
          hbase> version

Show all the filters in hbase.
          hbase> show_filters

Get the status of the alter command. Indicates the number of regions of the table that have received the updated schema Pass table name.
         hbase> alter_status ‘t1′

Flush all regions in passed table or pass a region row to flush an individual region. 
         hbase> flush ‘TABLENAME’
         hbase> flush ‘REGIONNAME’

Run major compaction on passed table or pass a region row to major compact an individual region. To compact a single column family within a region specify the region name followed by the column family name.
Compact all regions in a table:
         hbase> major_compact ‘t1′
Compact an entire region:
         hbase> major_compact ‘r1′
Compact a single column family within a region:
         hbase> major_compact ‘r1′, ‘c1′
Compact a single column family within a table:
         hbase> major_compact ‘t1′, ‘c1′

Split entire table or pass a region to split individual region. With the second parameter, you can specify an explicit split key for the region.
         hbase>split ‘tableName’
         hbase>split ‘regionName’ # format: ‘tableName,startKey,id’
         hbase>split ‘tableName’, ‘splitKey’
         hbase>split ‘regionName’, ‘splitKey’

Dump status of HBase cluster as seen by ZooKeeper. 

Restarts all the replication features. The state in which each stream starts in is undetermined. 
WARNING: start/stop replication is only meant to be used in critical load situations.
       hbase> start_replication

Stops all the replication features. The state in which each stream stops in is undetermined.
WARNING: start/stop replication is only meant to be used in critical load situations.
       hbase> stop_replication

Grant users specific rights.
Syntax : grant permissions is either zero or more letters from the set “RWXCA”.
         hbase> grant ‘bobsmith’, ‘RWXCA’
         hbase> grant ‘bobsmith’, ‘RW’, ‘t1′, ‘f1′, ‘col1′

Revoke a user’s access rights.
Syntax : revoke
        hbase> revoke ‘bobsmith’, ‘t1′, ‘f1′, ‘col1′

Show all permissions for the particular user.
Syntax : user_permission
      hbase> user_permission
     hbase> user_permission ‘table1′

Hbase Data model

Hbase Data model - These six concepts form the foundation of HBase.

Table: HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path.

Row: Within a table, data is stored according to its row. Rows are identified uniquely by their rowkey. Rowkeys don’t have a data type and are always treated as a byte[].

Column family: Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase. For this reason,they must be
defined up front and aren’t easily modified. Every row in a table has the same column families, although a row need not store data in all its families.Column family names are Strings and composed of characters that are safe for use in a file system path.

Column qualifier: Data within a column family is addressed via its column qualifier,or column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows. Like rowkeys, column qualifiers don’t have a data type and are always treated as a byte[].

Cell: A combination of rowkey, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also don’t have a data type and are always treated as a byte[].

Version: Values within a cell are versioned. Versions are identified by their timestamp,a long. When a version isn’t specified, the current timestamp is used as the
basis for the operation. The number of cell value versions retained by HBase is configured via the column family. The default number of cell versions is three. Versions stored in decreasing order of timestamp.

HBase Architecture

      The HBase Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server contains multiple Regions – HRegions.

Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the same number of Regions.

The HMaster in the HBase is responsible for
  • Performing Administration
  • Managing and Monitoring the Cluster
  • Assigning Regions to the Region Servers
  • Controlling the Load Balancing and Failover
On the other hand, the HRegionServer perform the following work
  • Hosting and managing Regions
  • Splitting the Regions automatically
  • Handling the read/write requests
  • Communicating with the Clients directly

Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families (explained below). The MemStore holds in-memory modifications to the Store (data).

The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data from HBase, the clients read the required Region information from the .META table and directly communicate with the appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive)

HBase Tables and Regions

Table is made up of any number of regions.
Region is specified by its startKey and endKey.
  • Empty table: (Table, NULL, NULL)
  • Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”, NULL)
Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover

HBase Tables:
  • Tables are sorted by Row in lexicographical order
  • Table schema only defines its column families
  • Each family consists of any number of columns
  • Each column consists of any number of versions
  • Columns only exist when inserted, NULLs are free
  • Columns within a family are sorted and stored together
  • Everything except table names are byte[]
  • Hbase Table format (Row, Family:Column, Timestamp) -> Value
Hbase consists of,
  • Java API, Gateway for REST, Thrift, Avro
  • Master manages cluster
  • RegionServer manage data
  • ZooKeeper is used the “neural network” and coordinates cluster
Data is stored in memory and flushed to disk on regular intervals or based on size
  • Small flushes are merged in the background to keep number of files small
  • Reads read memory stores first and then disk based files second
  • Deletes are handled with “tombstone” markers

After data is written to the WAL the RegionServer saves KeyValues in memory store
  • Flush to disk based on size, is hbase.hregion.memstore.flush.size
  • Default size is 64MB
  • Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time
Two types: Minor and Major Compactions
Minor Compactions
  • Combine last “few” flushes
  • Triggered by number of storage files
Major Compactions
  • Rewrite all storage files
  • Drop deleted data and those values exceeding TTL and/or number of versions
Key Cardinality:
The best performance is gained from using row keys
  • Time range bound reads can skip store files
  • So can Bloom Filters
  • Selecting column families reduces the amount of data to be scanned   

Fold, Store, and Shift:
All values are stored with the full coordinates,including: Row Key, Column Family, Column Qualifier, and Timestamp
  • Folds columns into “row per column”
  • NULLs are cost free as nothing is stored
  • Versions are multiple “rows” in folded table

Apache HBase

        HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.

What is HBase?

        HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. Unlike relational database systems, HBase does not support a structured query language like SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in Java much like a typical MapReduce application. HBase does support writing applications in Avro, REST, and Thrift.

       An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables must use this Primary Key. An HBase column represents an attribute of an object; for example, if the table is storing diagnostic logs from servers in your environment, where each row might be a log record, a typical column in such a table would be the timestamp of when the log record was written, or perhaps the server name where the record originated. In fact, HBase allows for many attributes to be grouped together into what are known as column families, such that the elements of a column family are all stored together. This is different from a row-oriented relational database, where all the columns of a given row are stored together. With HBase you must predefine the table schema and specify the column families. However, it’s very flexible in that new columns can be added to families at any time, making the schema flexible and therefore able to adapt to changing application requirements.

         Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker and TaskTracker slaves, HBase is built on similar concepts. In HBase a master node manages the cluster and region servers store portions of the tables and perform the work on the data. In the same way HDFS has some enterprise concerns due to the availability of the NameNode , HBase is also sensitive to the loss of its master node.

What is NoSQL DataBase?

           A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key–value stores intended primarily for simple retrieval and appending operations, whereas an RDBMS is intended as a general purpose data store. There will thus be some operations where NoSQL is faster and some where an RDBMS is faster. NoSQL databases are finding significant and growing industry use in big data and real-time web applications.[1] NoSQL systems are also referred to as "Not only SQL" to emphasize that they may in fact allow SQL-like query languages to be used.