BigDataRiding: 2014

Saturday, 11 October 2014

Neo4j Overview

Neo4j is an open-source graph database, implemented in Java.The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". Neo4j is the most popular graph database

Data Structure:

In Neo4j, everything is stored in form of nodes and relationships. Each node and relationship can have any number of attributes. Both the nodes and relationship can be labelled. Labeling is useful, because you can narrow down your searching area using the labels. Neo4j suported node indexing.

What is Neo4j?

Neo4j is an open-source graph database supported by Neo Technology. Neo4j stores data in nodes and relationships with properties on both are connected by directed(-> or <- or -).

Features:

intuitive, using a graph model for data representation
reliable, with full ACID transactions
durable and fast, using a custom disk-based, native storage engine
massively scalable, up to several billion nodes/relationships/properties
highly-available, when distributed across multiple machines
expressive, with a powerful, human readable graph query language fast, with a powerful traversal framework for high-speed graph queries
embeddable, with a few small jars
simple, accessible by a convenient REST interface or an object-oriented Java API

What is a Graph Database?

A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way.

A graph having different records(nodes and relations), A node and relation have different properties, the relationship organizes the nodes

Thursday, 2 October 2014

Pig UDF

Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript, Ruby and Groovy.

The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported such as the Algebraic Interface and the Accumulator Interface.

Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython, Rhino, JRuby or Groovy-all, to the backend.

Writing Java UDF for Swap

Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract method “exec” which user needs to implement in concrete class with appropriate functionality.

package com.pig.bigdatariding;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class Swap extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {
if (input == null || input.size() < 2)
return null;
try{
Tuple output = TupleFactory.getInstance().newTuple(2);
output.set(0, input.get(1));
output.set(1, input.get(0));
return output;
} catch(Exception e){
System.err.println("Failed to process input; error - " + e.getMessage());
return null;
}
}

}

grunt>register swapudf.jar;

Ex: emp.csv

1,bala

2,narayana

3,reddy

grunt>A= load 'emp.csv' USING PigStorage(',') as (id: int, name: chararray);

grunt>B= foreach A generate flatten(com.pig.bigdatariding.Swap(name,id));

Monday, 30 June 2014

Hive Architecture

Command line interface: It’s the default and the most common way of accessing hive.
Hiveserver : Runs hive as a server exposing a thrift service,enabling access from a range of clients written in different languages.
HWI : Hive web interface

Shell:Shell is the command line interface.It allows interactive queries like MySQL shell connected to database.Also supports web and JDBC clients.
Driver,compiler and execution engine take the HiveQL scripts and run in Hadoop environment.
Driver: The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces.
Compiler: The component that parses the query, does semantic analysis on the different queries blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.
Execution engine: The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components
Metastore: It store system catalog, The component that stores all the structure information of the various table and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.

The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this statement into a plan which consists of a DAG of mapreduce jobs.
The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order. Hive currently uses Hadoop as its execution engine.

Saturday, 1 March 2014

Pig Overview

Hive Vs Pig

Feature	Hive	Pig
Language	SQL-like	PigLatin
Schemas/Types	Yes (explicit)	Yes (implicit)
Partitions	Yes	No
Server	Optional (Thrift)	No
User Defined Functions (UDF)	Yes (Java)	Yes (Java)
Custom Serializer/Deserializer	Yes	Yes
DFS Direct Access	Yes (implicit)	Yes (explicit)
Join/Order/Sort	Yes	Yes
Shell	Yes	Yes
Streaming	Yes	Yes
Web Interface	Yes	No
JDBC/ODBC	Yes (limited)	No

Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
If Pig is "Scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.

WORD COUNT EXAMPLE - PIG SCRIPT

Q) How to find the number of occurrences of the words in a file using the pig script?

You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem.

The following pig script finds the number of times a word repeated in a file:

Word Count Example Using Pig Script

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement.

You can see just with 5 lines of pig program, we have solved the word count problem very easily.