DataMantra Blog

Feb 7, 2015
Pipe in Spark

Pipe operator in Spark, allows developer to process RDD data using external applications. Sometimes in data analysis, we need to use an external library which may not be written using Java/Scala. Ex: Fortran math libraries. In that case, spark’s pipe operator allows us to send the RDD data to the external application.
Nov 18, 2014
Kryo disk serialization in Spark

In apache spark, it’s advised to use the kryo serialization over java serialization for big data applications. Kryo has less memory footprint compared to java serialization which becomes very important when you are shuffling and caching large amount of data.
Sep 29, 2014
Evaluating Spark RDD's for side effects

Accumulators in Spark are highly useful to do side effect based operations. For example, the following code calculates both sum and sum of squares as a side effect.

Archives