Thursday, 6 February 2014

Hive Data Types

Hive data types are categorized into two types. They are the primitive and complex data types.

The primitive data types include Integers, Boolean, Floating point numbers and strings. The below table lists the size of each data type:

Primitive types:

  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • BOOLEAN
  • FLOAT
  • DOUBLE
  • BIGDECIMAL (Only available starting with Hive 0.10.0)
  • STRING
  • BINARY (Only available starting with Hive 0.8.0)
  • TIMESTAMP (Only available starting with Hive 0.8.0)
  • DATE (Only available starting with Hive 0.12.0)
  • VARCHAR (Only available starting with Hive 0.12.0)
  • CHAR (Only available starting with Hive 0.13.0)
  • DECIMAL (Introduced in Hive 0.11.0 with a precision of 38 digits, in Hive 0.13.0 introduced user definable precision and scale)

Complex Types:

  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
  • union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting with Hive 0.7.0)

Integral Types (TINYINT, SMALLINT, INT, BIGINT):

Integral literals are assumed to be INT by default, unless the number exceeds the range of INT in which case it is interpreted as a BIGINT, or if one of the following postfixes is present on the number.

Type                Postfix        Example

TINYINT            Y            100Y

SMALLINT        S             100S

BIGINT               L             100L


Floating Point Types:

Floating point literals are assumed to be DOUBLE. Scientific notation is not yet supported.


String Types:

String

String literals can be expressed with either single quotes (') or double quotes ("). Hive uses C-style escaping within the strings.

Varchar (Varchar datatype was introduced in Hive 0.12.0)

Varchar types are created with a length specifier (between 1 and 65355), which defines the maximum number of characters allowed in the character string. If a string value being converted/assigned to a varchar value exceeds the length specifier, the string is silently truncated. Character length is determined by the number of code points contained by the character string.

Limitation: Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.
There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.

Char (Char datatype will be introduced in Hive 0.13.0)

Char types are similar to Varchar but they are fixed-length meaning that values shorter than the specified length value are padded with spaces but trailing spaces are not important during comparisons. The maximum length is fixed at 255.

Decimal Types (Decimal datatype was introduced in Hive 0.11.0):

Decimal literals provide precise values and greater range for floating point numbers than the DOUBLE type. Decimal data types store exact representations of numeric values, while DOUBLE data types store very close approximations of numeric values.

Decimal types are needed for use cases in which the (very close) approximation of a DOUBLE is insufficient, such as financial applications, equality and inequality checks, and rounding operations. They are also needed for use cases that deal with numbers outside the DOUBLE range (approximately -10308 to 10308) or very close to zero (-10-308 to 10-308). For a general discussion of the limits of the DOUBLE type, see the Wikipedia article Double-precision floating-point format.

The precision of a Decimal type is limited to 38 digits in Hive

Using Decimal Types:

You can create a table in Hive that uses the Decimal type with the following syntax:
                       create table decimal_1 (t decimal);

Complex Data Types:

Arrays
Contain a list of elements of the same data type. These elements are accessed by using an index. For example an array, “fruits”, containing a list of elements [‘apple’, ’mango’, ‘orange’], the element “apple” in the array can be accessed by specifying fruits[1].

Maps
Contains key, value pairs. The elements are accessed by using the keys. For example a map, “pass_list” containing the “user name” as key and “password” as value, the password of the user can be accessed by specifying pass_list[‘username’]

Structs
Contains elements of different data types. The elements can be accessed by using the dot notation. For example in a struct, ”car”, the color of the car can be retrieved as specifying car.color



No comments:

Post a Comment