BigDataRiding: December 2013

Tuesday, 31 December 2013

HBase Examples

Go to HBase Mode
$hbase shell

List all the tables
hbase>list

Create HBase table with Normal Mode
hbase>create ‘cars’, ‘vi’

Let’s insert 3 column qualifies (make, model, year) and the associated values into the first row (row1).
hbase>put ‘cars’, ‘row1’, ‘vi:make’, ‘BMW’
hbase>put ‘cars’, ‘row1’, ‘vi:model’, ‘5 series’
hbase>put ‘cars’, ‘row1’, ‘vi:year’, ‘2012’

Now let’s add second row
hbase>put ‘cars’, ‘row2’, ‘vi:make’, ‘Ferari’
hbase>put ‘cars’, ‘row2’, ‘vi:model’, ‘e series’
hbase>put ‘cars’, ‘row2’, ‘vi:year’, ‘2012’

Now let’s add third row
hbase>put ‘cars’, ‘row3’, ‘vi:make’, ‘Honda’
hbase>put ‘cars’, ‘row3’, ‘vi:model’, ‘f series’
hbase>put ‘cars’, ‘row3’, ‘vi:year’, ‘2013’

Sacn the table
hbase>scan ‘cars’

The next scan we’ll run will limit our results to the make column qualifier.
hbase>scan ‘cars’, {COLUMNs=>[‘vi:make’]}

1 row to demonstrate how LIMIT works.
hbase>scan ‘cars’, {COLUMNS =>[‘vi:make’], LIMIT => 1}

We’ll start by getting all columns in row1.
hbase>get ‘cars’, ‘row1’

You should see output similar to:
COLUMN CELL
vi:make timestamp=1344817012999, value=bmw
vi:model timestamp=1344817020843, value=5 series
vi:year timestamp=1344817033611, value=2012

To get one specific column include the COLUMN option.

hbase>get ‘cars’, ‘row1’, {COLUMNS => ‘vi:make’}

You can also get two or more columns by passing an array of columns.

hbase>get ‘cars’, ‘row1’, {COLUMNS => [‘vi:make’, ‘vi:year’]}

Delete a cell (value)

hbase>delete ‘cars’, ‘row2’, ‘vi:year’

Let’s check that our delete worked

hbase>get ‘cars’, ‘row2’

You should see output that shows 2 columns.

COLUMN CELL
vi:make timestamp=1344817104923, value=mercedes
vi:model timestamp=1344817115463, value=e class 2
row(s) in 0.0080 seconds

Disable and drop tables
>disable ‘cars’

>drop ‘cars’

Exit the table
>exit

HBase Shell Commands

whoami:
Show the current hbase user.
Example:
hbase> whoami

alter:
Alter column family schema; pass table name and a dictionary specifying new column family schema. Dictionaries are described below in the GENERAL NOTES section. Dictionary must include name
of column family to alter.
For example,

To change or add the 'f1' column family in table 't1' from defaults to instead keep a maximum of 5 cell VERSIONS, do:
hbase> alter 't1', {NAME => 'f1', VERSIONS => 5}

To delete the 'f1' column family in table 't1', do:
hbase> alter 't1', {NAME => 'f1', METHOD => 'delete'}

You can also change table-scope attributes like MAX_FILESIZE
MEMSTORE_FLUSHSIZE and READONLY.

For example, to change the max size of a family to 128MB, do:
hbase> alter 't1', {METHOD => 'table_att', MAX_FILESIZE => '134217728'}

count:
Count the number of rows in a table. This operation may take a LONG time (Run '$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount' to run a counting mapreduce job). Current count is shown every 1000 rows by default. Count interval may be optionally specified.
Examples:

hbase> count 't1'
hbase> count 't1', 100000
hbase> t.count INTERVAL => 100000
hbase> t.count CACHE => 1000
hbase> t.count INTERVAL => 10, CACHE => 1000

create:

Create table; pass table name, a dictionary of specifications per column family, and optionally a dictionary of table configuration. Dictionaries are described below in the GENERAL NOTES section.

Examples:

hbase> create 't1', {NAME => 'f1', VERSIONS => 5}

hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}

hbase> # The above in shorthand would be the following:

hbase> create 't1', 'f1', 'f2', 'f3'

hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, \

BLOCKCACHE => true}

describe:

Describe the named table
Example:
e.g. "hbase> describe 't1'"

delete:

Put a delete cell value at specified table/row/column and optionally timestamp coordinates. Deletes must match the deleted cell's coordinates exactly. When scanning, a delete cell suppresses older versions. Takes arguments like the 'put' command described below
Example:
hbase> delete ‘t1′, ‘r1′, ‘c1′, ts1

deleteall:

Delete all cells in a given row; pass a table name, row, and optionally a column and timestamp
Delete all cells in a given row; pass a table name, row, and optionally a column and timestamp.
Examples:
hbase> deleteall ‘t1′, ‘r1′
hbase> deleteall ‘t1′, ‘r1′, ‘c1′
hbase> deleteall ‘t1′, ‘r1′, ‘c1′, ts1
The same commands also can be run on a table reference. Suppose you had a reference t to table ‘t1′, the corresponding command would be:
Example:
hbase> t.deleteall ‘r1′
hbase> t.deleteall ‘r1′, ‘c1′
hbase> t.deleteall ‘r1′, ‘c1′, ts1

disable:

Disable the named table:
Example:
e.g. "hbase> disable 't1'"

disable_all:
Disable all of tables matching the given regex
Example:
hbase> disable_all ‘t.*’

drop:

Drop the named table. Table must first be disabled.
Example:
hbase> drop ‘t1′

drop_all:
Drop all of the tables matching the given regex
Example:
hbase> drop_all ‘t.*’

enable:

Enable the named table
Example:
hbase> enable ‘t1′

enable_all:
Enable all of the tables matching the given regex
Example:
hbase> enable_all ‘t.*’

is_enabled:
verifies Is named table enabled
Example:
hbase> is_enabled ‘t1′

exists:

Does the named table exist?
Example:
e.g. "hbase> exists 't1'"

exit:

Type "hbase> exit" to leave the HBase Shell

get:

Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp and versions.

Examples:

hbase> get 't1', 'r1'

hbase> get 't1', 'r1', {COLUMN => 'c1'}

hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}

hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1}

hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, \

VERSIONS => 4}

list:

List all tables in hbase
Example:
hbase> list
hbase> list ‘abc.*’

put:

Put a cell 'value' at specified table/row/column and optionally timestamp coordinates. To put a cell value into table 't1' at row 'r1' under column 'c1' marked with the time 'ts1', do:

Example:

hbase> put 't1', 'r1', 'c1', 'value', ts1

tools:

Listing of hbase surgery tools

scan:

Scan a table; pass table name and optionally a dictionary of scanner specifications. Scanner specifications may include one or more of the following: LIMIT, STARTROW, STOPROW, TIMESTAMP, or COLUMNS. If no columns are specified, all columns will be scanned. To scan all members of a column family, leave the qualifier empty as in 'col_family:'.

Examples:

hbase> scan '.META.'

hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}

hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, \

STARTROW => 'xyz'}

For experts, there is an additional option -- CACHE_BLOCKS -- which switches block caching for the scanner on (true) or off (false). By default it is enabled.

Examples:

hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}

status:

Show cluster status. Can be 'summary', 'simple', or 'detailed'. The default is 'summary'.

Examples:

hbase> status

hbase> status 'simple'

hbase> status 'summary'

hbase> status 'detailed'

shutdown:

Shut down the cluster.

truncate:

Disables, drops and recreates the specified table.
Example:
hbase>truncate ‘t1′

version:

Output this HBase version
Example:
hbase> version

show_filters:
Show all the filters in hbase.
Example:
hbase> show_filters

alter_status:

Get the status of the alter command. Indicates the number of regions of the table that have received the updated schema Pass table name.

Example:

hbase> alter_status ‘t1′

flush:

Flush all regions in passed table or pass a region row to flush an individual region.

Example:

hbase> flush ‘TABLENAME’

hbase> flush ‘REGIONNAME’

major_compact:

Run major compaction on passed table or pass a region row to major compact an individual region. To compact a single column family within a region specify the region name followed by the column family name.

Examples:

Compact all regions in a table:

hbase> major_compact ‘t1′

Compact an entire region:

hbase> major_compact ‘r1′

Compact a single column family within a region:

hbase> major_compact ‘r1′, ‘c1′

Compact a single column family within a table:

hbase> major_compact ‘t1′, ‘c1′

split:

Split entire table or pass a region to split individual region. With the second parameter, you can specify an explicit split key for the region.

Examples:

hbase>split ‘tableName’

hbase>split ‘regionName’ # format: ‘tableName,startKey,id’

hbase>split ‘tableName’, ‘splitKey’

hbase>split ‘regionName’, ‘splitKey’

zk_dump:

Dump status of HBase cluster as seen by ZooKeeper.

Example:

hbase>zk_dump

start_replication:

Restarts all the replication features. The state in which each stream starts in is undetermined.

WARNING: start/stop replication is only meant to be used in critical load situations.

Examples:

hbase> start_replication

stop_replication:

Stops all the replication features. The state in which each stream stops in is undetermined.

WARNING: start/stop replication is only meant to be used in critical load situations.

Examples:

hbase> stop_replication

grant:

Grant users specific rights.

Syntax : grant permissions is either zero or more letters from the set “RWXCA”.

READ(‘R’), WRITE(‘W’), EXEC(‘X’), CREATE(‘C’), ADMIN(‘A’)

Example:

hbase> grant ‘bobsmith’, ‘RWXCA’

hbase> grant ‘bobsmith’, ‘RW’, ‘t1′, ‘f1′, ‘col1′

revoke:

Revoke a user’s access rights.

Syntax : revoke

Example:

hbase> revoke ‘bobsmith’, ‘t1′, ‘f1′, ‘col1′

user_permission:

Show all permissions for the particular user.

Syntax : user_permission

Example:

hbase> user_permission

hbase> user_permission ‘table1′

Hbase Data model

Hbase Data model - These six concepts form the foundation of HBase.

Table: HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path.

Row: Within a table, data is stored according to its row. Rows are identified uniquely by their rowkey. Rowkeys don’t have a data type and are always treated as a byte[].

Column family: Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase. For this reason,they must be
defined up front and aren’t easily modified. Every row in a table has the same column families, although a row need not store data in all its families.Column family names are Strings and composed of characters that are safe for use in a file system path.

Column qualifier: Data within a column family is addressed via its column qualifier,or column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows. Like rowkeys, column qualifiers don’t have a data type and are always treated as a byte[].

Cell: A combination of rowkey, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also don’t have a data type and are always treated as a byte[].

Version: Values within a cell are versioned. Versions are identified by their timestamp,a long. When a version isn’t specified, the current timestamp is used as the
basis for the operation. The number of cell value versions retained by HBase is configured via the column family. The default number of cell versions is three. Versions stored in decreasing order of timestamp.

HBase Architecture

The HBase Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server contains multiple Regions – HRegions.

Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the same number of Regions.

The HMaster in the HBase is responsible for

Performing Administration
Managing and Monitoring the Cluster
Assigning Regions to the Region Servers
Controlling the Load Balancing and Failover

On the other hand, the HRegionServer perform the following work

Hosting and managing Regions
Splitting the Regions automatically
Handling the read/write requests
Communicating with the Clients directly

Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families (explained below). The MemStore holds in-memory modifications to the Store (data).

The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data from HBase, the clients read the required Region information from the .META table and directly communicate with the appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive)

HBase Tables and Regions

Table is made up of any number of regions.

Region is specified by its startKey and endKey.

Empty table: (Table, NULL, NULL)
Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”, NULL)

Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover

HBase Tables:

Tables are sorted by Row in lexicographical order
Table schema only defines its column families
Each family consists of any number of columns
Each column consists of any number of versions
Columns only exist when inserted, NULLs are free
Columns within a family are sorted and stored together
Everything except table names are byte[]
Hbase Table format (Row, Family:Column, Timestamp) -> Value

Hbase consists of,

Java API, Gateway for REST, Thrift, Avro
Master manages cluster
RegionServer manage data
ZooKeeper is used the “neural network” and coordinates cluster

Data is stored in memory and flushed to disk on regular intervals or based on size

Small flushes are merged in the background to keep number of files small
Reads read memory stores first and then disk based files second
Deletes are handled with “tombstone” markers

MemStores:

After data is written to the WAL the RegionServer saves KeyValues in memory store

Flush to disk based on size, is hbase.hregion.memstore.flush.size
Default size is 64MB
Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time

Compactions:

Two types: Minor and Major Compactions

Minor Compactions

Combine last “few” flushes
Triggered by number of storage files

Major Compactions

Rewrite all storage files
Drop deleted data and those values exceeding TTL and/or number of versions

Key Cardinality:

The best performance is gained from using row keys

Time range bound reads can skip store files
So can Bloom Filters
Selecting column families reduces the amount of data to be scanned

Fold, Store, and Shift:

All values are stored with the full coordinates,including: Row Key, Column Family, Column Qualifier, and Timestamp

Folds columns into “row per column”
NULLs are cost free as nothing is stored
Versions are multiple “rows” in folded table

Apache HBase

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.

What is HBase?

HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. Unlike relational database systems, HBase does not support a structured query language like SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in Java much like a typical MapReduce application. HBase does support writing applications in Avro, REST, and Thrift.

An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables must use this Primary Key. An HBase column represents an attribute of an object; for example, if the table is storing diagnostic logs from servers in your environment, where each row might be a log record, a typical column in such a table would be the timestamp of when the log record was written, or perhaps the server name where the record originated. In fact, HBase allows for many attributes to be grouped together into what are known as column families, such that the elements of a column family are all stored together. This is different from a row-oriented relational database, where all the columns of a given row are stored together. With HBase you must predefine the table schema and specify the column families. However, it’s very flexible in that new columns can be added to families at any time, making the schema flexible and therefore able to adapt to changing application requirements.

Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker and TaskTracker slaves, HBase is built on similar concepts. In HBase a master node manages the cluster and region servers store portions of the tables and perform the work on the data. In the same way HDFS has some enterprise concerns due to the availability of the NameNode , HBase is also sensitive to the loss of its master node.

What is NoSQL DataBase?

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key–value stores intended primarily for simple retrieval and appending operations, whereas an RDBMS is intended as a general purpose data store. There will thus be some operations where NoSQL is faster and some where an RDBMS is faster. NoSQL databases are finding significant and growing industry use in big data and real-time web applications.[1] NoSQL systems are also referred to as "Not only SQL" to emphasize that they may in fact allow SQL-like query languages to be used.

Tuesday, 31 December 2013

HBase Examples

HBase Shell Commands

Hbase Data model

HBase Architecture

Apache HBase

Contact Form