Saturday, 11 October 2014

Neo4j Overview

Neo4j is an open-source graph database, implemented in Java.The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". Neo4j is the most popular graph database

Data Structure:

In Neo4j, everything is stored in form of nodes and relationships. Each node and relationship can have any number of attributes. Both the nodes and relationship can be labelled. Labeling is useful, because you can narrow down your searching area using the labels. Neo4j suported node indexing.

What is Neo4j?

Neo4j is an open-source graph database supported by Neo Technology. Neo4j stores data in nodes and relationships with properties on both are connected by directed(-> or <- or -).


  • intuitive, using a graph model for data representation
  • reliable, with full ACID transactions
  • durable and fast, using a custom disk-based, native storage engine
  • massively scalable, up to several billion nodes/relationships/properties
  • highly-available, when distributed across multiple machines
  • expressive, with a powerful, human readable graph query language fast, with a powerful traversal framework for high-speed graph queries
  • embeddable, with a few small jars
  • simple, accessible by a convenient REST interface or an object-oriented Java API

What is a Graph Database?

A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way.

A graph having different records(nodes and relations), A node and relation have different properties, the relationship organizes the nodes

Thursday, 2 October 2014


Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript, Ruby and Groovy.

The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported such as the Algebraic Interface and the Accumulator Interface.

Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython, Rhino, JRuby or Groovy-all, to the backend.

Writing Java UDF for Swap

Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract method “exec” which user needs to implement in concrete class with appropriate functionality.

package com.pig.bigdatariding; 
import org.apache.pig.EvalFunc;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class Swap extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {
        if (input == null || input.size() < 2)
             return null;
             Tuple output = TupleFactory.getInstance().newTuple(2);
             output.set(0, input.get(1));
             output.set(1, input.get(0));
             return output;
         } catch(Exception e){
             System.err.println("Failed to process input; error - " + e.getMessage());
             return null;


Register jar file:

grunt>register swapudf.jar;

Ex: emp.csv

grunt>A= load 'emp.csv' USING PigStorage(',') as (id: int, name: chararray);

grunt>B= foreach A generate flatten(com.pig.bigdatariding.Swap(name,id));