Saturday 8 February 2014

Hive Complex Data Types with Examples

There are three complex types in hive,

arrays: It is an ordered collection of elements.The elements in the array must be of the same type.

map: It is an unordered collection of key-value pairs.Keys must be of primitive types.Values can be of any type.

struct: It is a collection of elements of different types.

Examples: complex datatypes

ARRAY:

$ cat >arrayfile
1,abc,40000,a$b$c,hyd
2,def,3000,d$f,bang

hive> create table tab7(id int,name string,sal bigint,sub array<string>,city string)
    > row format delimited  
    > fields terminated by ','
    > collection items terminated by '$';

hive>select sub[2] from tab7 where id=1;

hive>select sub[0] from tab7;

MAP:

$ cat >mapfile
1,abc,40000,a$b$c,pf#500$epf#200,hyd
2,def,3000,d$f,pf#500,bang

hive>create table tab10(id int,name string,sal bigint,sub array<string>,dud map<string,int>,city string)
row format delimited
fields terminated by ','
collection items terminated by '$'
map keys terminated by '#';

hive> load data local inpath '/home/training/mapfile' overwrite into table tab10;

hive>select dud["pf"] from tab10;

hive>select dud["pf"],dud["epf"] from tab10;

STRUCT:

cat >mapfile
1,abc,40000,a$b$c,pf#500$epf#200,hyd$ap$500001
2,def,3000,d$f,pf#500,bang$kar$600038

hive> create table tab11(id int,name string,sal bigint,sub array<string>,dud map<string,int>,addr struct<city:string,state:string,pin:bigint>)
    > row format delimited
    > fields terminated by ','
    > collection items terminated by '$'
    > map keys terminated by '#';

hive> load data local inpath '/home/training/structfile' into table tab11;

hive>select addr.city from tab11;

Hive Internal & External Table

A Hive table is a logical concept that’s physically comprised of a number of files in HDFS. Tables can either be

Hive Internal Table:

Internal table—If our data available into local file system then we should go for Hive internal table. Where Hive organizes them inside a warehouse directory, which is controlled by the hive.metastore.warehouse.dir property whose  default value is /user/hive/warehouse (in HDFS);

Note: In internal table the data and table is tightly coupled,  if we are trying to drop the table means both table, data and metadata droped.

**** Internal table with load is recommended.****

Example:


1. To create the internal table
    Hive>CREATE TABLE managed_table (dummy STRING);
         Row format delimited
         Fields terminated by ‘\t’;

2. Load the data into internal table
    Hive>LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;


3. Display the content of the table

    Hive>select * from managed_table;

4. To drop the internal table
    Hive>DROP TABLE managed_table;

If you dropped the internal/managed table, including its metadata and its data, is deleted.


Hive External Table:

external table—If our data available into HDFS the we should go for Hive External table. in this case Hive doesn’t manage them.

Note: In External table the data and table is loosely  coupled, if we are trying to drop the External table, the table is droped data is available into HDFS.

**** External table with location is recommended.****

Internal tables are useful if you want Hive to manage the complete lifecycle of your data including the deletion, whereas external tables are useful when the files are being used outside of Hive.

Example:
 
1. Create External table
    Hive>CREATE EXTERNAL TABLE external_table (dummy STRING)
               Row format delimited
               Fields terminated by ‘\t’
               LOCATION '/user/tom/external_table';

2. If we are not specifying the location at the time of table creation, we can load the data manually
    Hive>LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;


3. Display the content of the table
    Hive>select * from external_table;

4. To drop the internal table
    Hive>DROP TABLE external_table;

When you drop an external table, Hive will leave the data untouched and only delete the metadata


Hive Services

Cli  ---The command line interface to Hive (the shell). This is the default service.

Hiveserver --Runs Hive as a server exposing a Thrift service, enabling access from a range of  clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).

Hwi--The Hive Web Interface

Jar -- The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath.

Metastore --By default, the metastore is run in the same process as the Hive service. Using this service, it is possible to run the metastore as a standalone (remote) process. Set the METASTORE_PORT environment variable to specify the port the server will listen on

Hive maintains metadata in a metastore, which is stored in a relational database. This metadata contains information about what tables exist, their columns, privileges, and more.

By default Hive uses Derby to store the metastore, which is an embedded Java relational database. Because it’s embedded, Derby can’t be shared between users, and as such it can’t be used in a multiuser environment where the metastore needs to be shared

Hive can support multiple databases, which can be used to avoid table name collisions (two teams or users that have the same table name) and to allow separate databases for different users or products.

A Hive table is a logical concept that’s physically comprised of a number of files in HDFS.
Tables can either be

internal—where Hive organizes them inside a warehouse directory, which is controlled by the hive.metastore.warehouse.dir property whose  default value is /user/hive/warehouse (in HDFS);

                                                                 or

 external—in which case Hive doesn’t manage them.

Internal tables are useful if you want Hive to manage the complete lifecycle of your data including the deletion, whereas external tables are useful when the files are being used outside of Hive.

Hive Shell is run on two modes

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL, so if you are familiar with MySQL you should feel at home using Hive.

Like SQL, HiveQL is generally case insensitive (except for string comparisons),
The tab key will autocomplete Hive keywords and functions.

You can use the up and down arrow keys to scroll through previous commands. Hive saves the last  100,00 lines into a file $HOME/.hivehistory.


If required we should tell the CLI to print column headers, which is disabled by default. We can enable this feature by setting the hiveconf property hive.cli.print.header to true

hive> set hive.cli.print.header=true;

Hive Shell is run on two modes:

1. Non-Interactive mode
2. Interactive mode

Hive Non-Interactive mode:

You can also run the Hive shell in non-interactive mode.
The -f option runs the commands in the specified file, script.q, in this

example:
Input file: 

$cat >abc
01,sam
02,ram
03,balu
^d(ctrl+d) --save the file

$cat>script.q
create table testdata(id int,str string) row format delimited fields terminated by ',';
load data local inpath '/home/training/abc' into table testdata(tablename);
^d(Ctrl+d)-- save the file

Execute the above script.q file by using -f option


$hive -f script.q

Then one directory(testdata) is create into /user/hive/warehouse directory.


For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required:

$hive -e 'SELECT * FROM testdata'

In both interactive and non-interactive mode, Hive will print information to standard Error. You can suppress these messages using the -S option at launch time, which has the effect of only showing the output result for queries:

$ hive -S -e 'SELECT * FROM dummy'

we are storing Hive tables on the local filesystem (fs.default.name is set to its default value of file:///). Tables are stored as directories under Hive’s warehouse directory, which is controlled by the hive.metastore.warehouse.dir, and defaults to /user/hive/warehouse.

Thus, the files for the records table are found in the /user/hive/warehouse/records directory on the local filesystem:
$ls  /user/hive/warehouse/records/


Hive Interactive mode:

Directly goto the hive mode and run the queries

Ex:

$hive --enter, goto hive mode

hive>

Example: 

create file

$cat >abc
01,balu
02,balu1
03,balu2

ctrl+d(save)

create the hive table in interactive mode:

hive>create table testdata1(id int, name string) row format delimited fields terminated by ',';

load the data into table:

hive>load data local inpath '/home/training/abc' into table testdata1;




 

Friday 7 February 2014

Aggregate Functions in Hive

The following are built-in aggregate functions are supported in Hive:


count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*) - Returns the total number of retrieved rows, including rows containing NULL values;
count(expr) - Returns the number of rows for which the supplied expression is non-NULL;
count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

sum(col), sum(DISTINCT col)

Returns the sum of the elements in the group or the sum of the distinct values of the column in the group

avg(col), avg(DISTINCT col)

Returns the average of the elements in the group or the average of the distinct values of the column in the group

min(col)

Returns the minimum of the column in the group

max(col)

Returns the maximum value of the column in the group

variance(col), var_pop(col)

Returns the variance of a numeric column in the group

var_samp(col)

Returns the unbiased sample variance of a numeric column in the group

stddev_pop(col)

Returns the standard deviation of a numeric column in the group

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group

covar_pop(col1, col2)

Returns the population covariance of a pair of numeric columns in the group

covar_samp(col1, col2)

Returns the sample covariance of a pair of a numeric columns in the group

corr(col1, col2)

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group

percentile(BIGINT col, p)

Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

percentile(BIGINT col, array(p1 [, p2]...))

Returns the exact percentiles p1, p2, ... of a column in the group (does not work with floating point types). pi must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

percentile_approx(DOUBLE col, p [, B])
Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B])

Same as above, but accepts and returns an array of percentile values instead of a single one.

histogram_numeric(col, b)

Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights

collect_set(col)

Returns a set of objects with duplicate elements eliminated

collect_list(col)

Returns a list of objects with duplicates (as of Hive 0.13.0)

INTEGER    ntile(INTEGER x)   

This function divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.)

Hive Built-In Functions

Functions in Hive are categorized as below.

  1. Numeric and Mathematical Functions: These functions mainly used to perform mathematical calculations.
  2. Date Functions: These functions are used to perform operations on date data types like adding the number of days to the date etc.
  3. String Functions: These functions are used to perform operations on strings like finding the length of a string etc.
  4. Conditional Functions: These functions are used to test conditions and returns a value based on whether the test condition is true or false.
  5. Collection Functions: These functions are used to find the size of the complex types like array and map. The only collection function is SIZE. The SIZE function is used to find the number of elements in an array and map. The syntax of SIZE function is
  6. Type Conversion Function: This function is used to convert from one data type to another. The only type conversion function is CAST. The syntax of CAST is
    The CAST function converts the expr into the specified type.
  7. Table Generating Functions: These functions transform a single row into multiple rows. EXPLODE is the only table generated function. This function takes array as an input and outputs the elements of array into separate rows. The syntax of EXPLODE is

Numeric and Mathematical Functions:

The Numerical functions are listed below in alphabetical order. Use these functions in SQL queries.
ABS( double n )

The ABS function returns the absolute value of a number.

Example: ABS(-100)

ACOS( double n )

The ACOS function returns the arc cosine of value n. This function returns Null if the value n is not in the range of -1<=n<=1.

Example: ACOS(0.5)

ASIN( double n )

The ASIN function returns the arc sin of value n. This function returns Null if the value n is not in the range of -1<=n<=1.

Example: ASIN(0.5)

BIN( bigint n )

The BIN function returns the number n in the binary format.

Example: BIN(100)

CEIL( double n ), CEILING( double n )

The CEILING or CEILING function returns the smallest integer greater than or equal to the decimal value n.

Example: CEIL(9.5)

CONV( bigint n, int from_base, int to_base )

The CONV function converts the given number n from one base to another base.

Example: CONV(100, 10,2)

COS( double n )

The COS function returns the cosine of the value n. Here n should be specified in radians.

Example: COS(180*3.1415926/180)

EXP( double n )

The EXP function returns e to the power of n. Where e is the base of natural logarithm and its value is 2.718.

Example: EXP(50)

FLOOR( double n )

The FLOOR function returns the largest integer less than or equal to the given value n.

Example: FLOOR(10.9)

HEX( bigint n)

This function converts the value n into hexadecimal format.

Example: HEX(16)

HEX( string n )

This function converts each character into hex representation format.

Example: HEX(‘ABC’)

LN( double n )

The LN function returns the natural log of a number.

Example: LN(123.45)

LOG( double base, double n )

The LOG function returns the base logarithm of the number n.

Example: LOG(3, 66)

LOG2( double n )

The LOG2 function returns the base-2 logarithm of the number n.

Example: LOG2(44)

LOG10( double n )

The LOG10 function returns the base-10 logarithm of the number n.

Example: LOG10(100)

NEGATIVE( int n ),  NEGATIVE( double n )

The NEGATIVE function returns –n

Example: NEGATIVE(10)

PMOD( int m, int n ), PMOD( double m, double n )

The PMOD function returns the positive modulus of a number.

Example: PMOD(3,2)

POSITIVE( int n ), POSITIVE( double n )

The POSITIVE function returns n

Example: POSITIVE(-10)

POW( double m, double n ), POWER( double m, double n )


The POW or POWER function returns m value raised to the n power.

Example: POW(10,2)

RAND( [int seed] )

The RAND function returns a random number. If you specify the seed value, the generated random number will become deterministic.

Example: RAND( )

ROUND( double value [, int n] )

The ROUND function returns the value rounded to n integer places.

Example: ROUND(123.456,2)

SIN( double n )

The SIN function returns the sin of a number. Here n should be specified in radians.

Example: SIN(2)

SQRT( double n )

The SQRT function returns the square root of the number

Example: SQRT(4)

UNHEX( string n )

The UNHEX function is the inverse of HEX function. It converts the specified string to the number format.

Example: UNHEX(‘AB’)


BIN(BIGINT a)

Returns the number in binary format

PMOD(INT a, INT b), PMOD(DOUBLE a, DOUBLE b)

Returns the positive value of a mod b

SIN(DOUBLE a)

Returns the sine of a (a is in radians)

TAN(DOUBLE a)

Returns the tangent of a (a is in radians)

ATAN(DOUBLE a)

Returns the arctangent of a

DEGREES(DOUBLE a)

Converts value of a from radians to degrees

RADIANS(DOUBLE a)

Converts value of a from degrees to radians

SIGN(DOUBLE a)

Returns the sign of a as '1.0' (if a is positive) or '-1.0' (if a is negative), '0.0' otherwise

e()

Returns the value of e

PI()

Returns the value of pi


Date Functions:

Note: Date data types do not exist in Hive. In fact the dates are treated as strings in Hive. The date functions are listed below.

UNIX_TIMESTAMP()

This function returns the number of seconds from the Unix epoch (1970-01-01 00:00:00 UTC) using the default time zone.

UNIX_TIMESTAMP( string date )

This function converts the date in format 'yyyy-MM-dd HH:mm:ss' into Unix timestamp. This will return the number of seconds between the specified date and the Unix epoch. If it fails, then it returns 0.

Example: UNIX_TIMESTAMP('2000-01-01 00:00:00') returns 946713600

UNIX_TIMESTAMP( string date, string pattern )

This function converts the date to the specified date format and returns the number of seconds between the specified date and Unix epoch. If it fails, then it returns 0.

Example: UNIX_TIMESTAMP('2000-01-01 10:20:30','yyyy-MM-dd') returns 946713600

FROM_UNIXTIME( bigint number_of_seconds  [, string format] )

The FROM_UNIX function converts the specified number of seconds from Unix epoch and returns the date in the format 'yyyy-MM-dd HH:mm:ss'.

Example: FROM_UNIXTIME( UNIX_TIMESTAMP() ) returns the current date including the time. This is equivalent to the SYSDATE in oracle.

TO_DATE( string timestamp )

The TO_DATE function returns the date part of the timestamp in the format 'yyyy-MM-dd'.

Example: TO_DATE('2000-01-01 10:20:30') returns '2000-01-01'

YEAR( string date )

The YEAR function returns the year part of the date.

Example: YEAR('2000-01-01 10:20:30') returns 2000

MONTH( string date )

The MONTH function returns the month part of the date.

Example: YEAR('2000-03-01 10:20:30') returns 3

DAY( string date ), DAYOFMONTH( date )

The DAY or DAYOFMONTH function returns the day part of the date.

Example: DAY('2000-03-01 10:20:30') returns 1

HOUR( string date )

The HOUR function returns the hour part of the date.

Example: HOUR('2000-03-01 10:20:30') returns 10

MINUTE( string date )

The MINUTE function returns the minute part of the timestamp.

Example: MINUTE('2000-03-01 10:20:30') returns 20

SECOND( string date )

The SECOND function returns the second part of the timestamp.

Example: SECOND('2000-03-01 10:20:30') returns 30

WEEKOFYEAR( string date )

The WEEKOFYEAR function returns the week number of the date.

Example: WEEKOFYEAR('2000-03-01 10:20:30') returns 9

DATEDIFF( string date1, string date2 )

The DATEDIFF function returns the number of days between the two given dates.

Example: DATEDIFF('2000-03-01', '2000-01-10')  returns 51

DATE_ADD( string date, int days )


The DATE_ADD function adds the number of days to the specified date

Example: DATE_ADD('2000-03-01', 5) returns '2000-03-06'

DATE_SUB( string date, int days )

The DATE_SUB function subtracts the number of days to the specified date

Example: DATE_SUB('2000-03-01', 5) returns ‘2000-02-25’


String Functions:


The string functions in Hive are listed below:

ASCII( string str )

The ASCII function converts the first character of the string into its numeric ascii value.

Example1: ASCII('hadoop') returns 104
Example2: ASCII('A') returns 65

CONCAT( string str1, string str2... )

The CONCAT function concatenates all the stings.

Example: CONCAT('hadoop','-','hive') returns 'hadoop-hive'

CONCAT_WS( string delimiter, string str1, string str2... )

The CONCAT_WS function is similar to the CONCAT function. Here you can also provide the delimiter, which can be used in between the strings to concat.

Example: CONCAT_WS('-','hadoop','hive') returns 'hadoop-hive'

FIND_IN_SET( string search_string, string source_string_list )

The FIND_IN_SET function searches for the search string in the source_string_list and returns the position of the first occurrence in the source string list. Here the source string list should be comma delimited one. It returns 0 if the first argument contains comma.

Example: FIND_IN_SET('ha','hao,mn,hc,ha,hef') returns 4

LENGTH( string str )

The LENGTH function returns the number of characters in a string.

Example: LENGTH('hive') returns 4

LOWER( string str ),  LCASE( string str )

The LOWER or LCASE function converts the string into lower case letters.

Example: LOWER('HiVe') returns 'hive'

LPAD( string str, int len, string pad )

The LPAD function returns the string with a length of len characters left-padded with pad.

Example: LPAD('hive',6,'v') returns 'vvhive'

LTRIM( string str )

The LTRIM function removes all the trailing spaces from the string.

Example: LTRIM('   hive') returns 'hive'

REPEAT( string str, int n )


The REPEAT function repeats the specified string n times.

Example: REPEAT('hive',2) returns 'hivehive'

RPAD( string str, int len, string pad )

The RPAD function returns the string with a length of len characters right-padded with pad.

Example: RPAD('hive',6,'v') returns 'hivevv'

REVERSE( string str )

The REVERSE function gives the reversed string

Example: REVERSE('hive') returns 'evih'

RTRIM( string str )

The RTRIM function removes all the leading spaces from the string.

Example: LTRIM('hive   ') returns 'hive'

SPACE( int number_of_spaces )

The SPACE function returns the specified number of spaces.

Example: SPACE(4) returns '    '

SPLIT( string str, string pat )

The SPLIT function splits the string around the pattern pat and returns an array of strings. You can specify regular expressions as patterns.

Example: SPLIT('hive:hadoop',':') returns ["hive","hadoop"]

SUBSTR( string source_str, int start_position [,int length]  ),  SUBSTRING( string source_str, int start_position [,int length]  )

The SUBSTR or SUBSTRING function returns a part of the source string from the start position with the specified length of characters. If the length is not given, then it returns from the start position to the end of the string.

Example1: SUBSTR('hadoop',4) returns 'oop'
Example2: SUBSTR('hadoop',4,2) returns 'oo'

TRIM( string str )

The TRIM function removes both the trailing and leading spaces from the string.

Example: LTRIM('   hive   ') returns 'hive'

UPPER( string str ), UCASE( string str )

The UPPER or LCASE function converts the string into upper case letters.

Example: UPPER('HiVe') returns 'HIVE'




Conditional Functions:


Hive supports three types of conditional functions. These functions are listed below:

IF( Test Condition, True Value, False Value )


The IF condition evaluates the “Test Condition” and if the “Test Condition” is true, then it returns the “True Value”. Otherwise, it returns the False Value.

Example: IF(1=1, 'working', 'not working') returns 'working'

COALESCE( value1,value2,... )

The COALESCE function returns the fist not NULL value from the list of values. If all the values in the list are NULL, then it returns NULL.

Example: COALESCE(NULL,NULL,5,NULL,4) returns 5


CASE Statement

The syntax for the case statement is:

CASE   [ expression ]
       WHEN condition1 THEN result1
       WHEN condition2 THEN result2
       ...
       WHEN conditionn THEN resultn
       ELSE result
END

Here expression is optional. It is the value that you are comparing to the list of conditions. (ie: condition1, condition2, ... conditionn).

All the conditions must be of same datatype. Conditions are evaluated in the order listed. Once a condition is found to be true, the case statement will return the result and not evaluate the conditions any further.

All the results must be of same datatype. This is the value returned once a condition is found to be true.

IF no condition is found to be true, then the case statement will return the value in the ELSE clause. If the ELSE clause is omitted and no condition is found to be true, then the case statement will return NULL

Example:

CASE   Fruit
       WHEN 'APPLE' THEN 'The owner is APPLE'
       WHEN 'ORANGE' THEN 'The owner is ORANGE'
       ELSE 'It is another Fruit'
END

The other form of CASE is


CASE
       WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
       WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
       ELSE 'It is another Fruit'

END


Collection Functions:

The following built-in collection functions are supported in hive:

size(Map<K.V>)

Returns the number of elements in the map type

size(Array<T>)

Returns the number of elements in the array type

map_keys(Map<K.V>)

Returns an unordered array containing the keys of the input map

map_values(Map<K.V>)

Returns an unordered array containing the values of the input map

array_contains(Array<T>, value)

Returns TRUE if the array contains value

sort_array(Array<T>)

Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0)


Type Conversion Function:

The following type conversion functions are supported in hive:

binary(string|binary)

Casts the parameter into a binary

cast(expr as <type>)

Converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed.


Table Generating Functions:

 Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows.

explode(ARRAY)

Returns one row for each element from the array

explode(MAP)

Returns one row for each key-value pair from the input map with two columns in each row: one for the key and another for the value. (as of Hive 0.8.0)

inline(ARRAY<STRUCT[,STRUCT]>)

Explodes an array of structs into a table (as of Hive 0.10)

explode(array<TYPE> a)

For each element in a, explode() generates a row containing that element

json_tuple(jsonStr, k1, k2, ...)

It takes a set of names (keys) and a JSON string, and returns a tuple of values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call

parse_url_tuple(url, p1, p2, ...)

This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY>.

posexplode(ARRAY)
Behaves like explode for arrays, but includes the position of items in the original array by returning a tuple of (pos, value) (as of Hive 0.13.0)

stack(INT n, v_1, v_2, ..., v_k)

Breaks up v_1, ..., v_k into n rows. Each row will have k/n columns. n must be constant.

 

Thursday 6 February 2014

Allowed Implicit Conversions in Hive


Hive Data Types

Hive data types are categorized into two types. They are the primitive and complex data types.

The primitive data types include Integers, Boolean, Floating point numbers and strings. The below table lists the size of each data type:

Primitive types:

  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • BOOLEAN
  • FLOAT
  • DOUBLE
  • BIGDECIMAL (Only available starting with Hive 0.10.0)
  • STRING
  • BINARY (Only available starting with Hive 0.8.0)
  • TIMESTAMP (Only available starting with Hive 0.8.0)
  • DATE (Only available starting with Hive 0.12.0)
  • VARCHAR (Only available starting with Hive 0.12.0)
  • CHAR (Only available starting with Hive 0.13.0)
  • DECIMAL (Introduced in Hive 0.11.0 with a precision of 38 digits, in Hive 0.13.0 introduced user definable precision and scale)

Complex Types:

  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
  • union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting with Hive 0.7.0)

Integral Types (TINYINT, SMALLINT, INT, BIGINT):

Integral literals are assumed to be INT by default, unless the number exceeds the range of INT in which case it is interpreted as a BIGINT, or if one of the following postfixes is present on the number.

Type                Postfix        Example

TINYINT            Y            100Y

SMALLINT        S             100S

BIGINT               L             100L


Floating Point Types:

Floating point literals are assumed to be DOUBLE. Scientific notation is not yet supported.


String Types:

String

String literals can be expressed with either single quotes (') or double quotes ("). Hive uses C-style escaping within the strings.

Varchar (Varchar datatype was introduced in Hive 0.12.0)

Varchar types are created with a length specifier (between 1 and 65355), which defines the maximum number of characters allowed in the character string. If a string value being converted/assigned to a varchar value exceeds the length specifier, the string is silently truncated. Character length is determined by the number of code points contained by the character string.

Limitation: Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.
There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.

Char (Char datatype will be introduced in Hive 0.13.0)

Char types are similar to Varchar but they are fixed-length meaning that values shorter than the specified length value are padded with spaces but trailing spaces are not important during comparisons. The maximum length is fixed at 255.

Decimal Types (Decimal datatype was introduced in Hive 0.11.0):

Decimal literals provide precise values and greater range for floating point numbers than the DOUBLE type. Decimal data types store exact representations of numeric values, while DOUBLE data types store very close approximations of numeric values.

Decimal types are needed for use cases in which the (very close) approximation of a DOUBLE is insufficient, such as financial applications, equality and inequality checks, and rounding operations. They are also needed for use cases that deal with numbers outside the DOUBLE range (approximately -10308 to 10308) or very close to zero (-10-308 to 10-308). For a general discussion of the limits of the DOUBLE type, see the Wikipedia article Double-precision floating-point format.

The precision of a Decimal type is limited to 38 digits in Hive

Using Decimal Types:

You can create a table in Hive that uses the Decimal type with the following syntax:
                       create table decimal_1 (t decimal);

Complex Data Types:

Arrays
Contain a list of elements of the same data type. These elements are accessed by using an index. For example an array, “fruits”, containing a list of elements [‘apple’, ’mango’, ‘orange’], the element “apple” in the array can be accessed by specifying fruits[1].

Maps
Contains key, value pairs. The elements are accessed by using the keys. For example a map, “pass_list” containing the “user name” as key and “password” as value, the password of the user can be accessed by specifying pass_list[‘username’]

Structs
Contains elements of different data types. The elements can be accessed by using the dot notation. For example in a struct, ”car”, the color of the car can be retrieved as specifying car.color



Wednesday 5 February 2014

SQL and HiveQL comparison




Hive Overview

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook.

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. (single user metadata stored into derby database and  multiple users metadata stored into MySQL or other databases).

Currently, there are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE.

HiveQL

Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification. There are a number of reasons for this. Being a fairly young project, it has not had time to provide the full repertoire of SQL-92 language constructs. More fundamentally, SQL-92 compliance has never been an explicit project goal; rather, as an open source project, features were added by developers to meet their users’ needs. Furthermore, Hive has some extensions that are not in SQL-92, which have been inspired by syntax from other database systems, notably MySQL. In fact, to a first-order approximation, HiveQL most closely resembles MySQL’s SQL dialect.