Thursday 2 October 2014

Pig UDF


Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript, Ruby and Groovy.

The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported such as the Algebraic Interface and the Accumulator Interface.


Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython, Rhino, JRuby or Groovy-all, to the backend.

Writing Java UDF for Swap

Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract method “exec” which user needs to implement in concrete class with appropriate functionality.

package com.pig.bigdatariding; 
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class Swap extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {
        if (input == null || input.size() < 2)
             return null;
         try{
             Tuple output = TupleFactory.getInstance().newTuple(2);
             output.set(0, input.get(1));
             output.set(1, input.get(0));
             return output;
         } catch(Exception e){
             System.err.println("Failed to process input; error - " + e.getMessage());
             return null;
         }
    } 

}



Register jar file:

grunt>register swapudf.jar;

Ex: emp.csv
      1,bala
      2,narayana
      3,reddy

grunt>A= load 'emp.csv' USING PigStorage(',') as (id: int, name: chararray);

grunt>B= foreach A generate flatten(com.pig.bigdatariding.Swap(name,id));




1 comment: