Wednesday 5 February 2014

Hive Overview

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook.

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. (single user metadata stored into derby database and  multiple users metadata stored into MySQL or other databases).

Currently, there are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE.

HiveQL

Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification. There are a number of reasons for this. Being a fairly young project, it has not had time to provide the full repertoire of SQL-92 language constructs. More fundamentally, SQL-92 compliance has never been an explicit project goal; rather, as an open source project, features were added by developers to meet their users’ needs. Furthermore, Hive has some extensions that are not in SQL-92, which have been inspired by syntax from other database systems, notably MySQL. In fact, to a first-order approximation, HiveQL most closely resembles MySQL’s SQL dialect.

No comments:

Post a Comment