Category Archives: RDD

RDD (1)

What is RDD?

RDD means Resilient Distributed Datasets(immutable, only-read), which can’t be changed, only can be transferred from other RDD.  RDD contains how is it transferred from others and how to re-build the data information.

One RDD contains many partitions; each partition is a dataset fragment. One Partition will produce one task. 

RDD is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDD, one is to parallelize the collection, the other is to reference a dataset in an external storage system. When you use parallelization to get RDD, it will cut the dataset into partitions. You can point out the number of partitions by following way. Default, Spark tries to set the number of partitions automatically based on your cluster.

sc.parallelize(data, 10)


RDD records update method to make sure resilient. But the update information is too much; this cost is not less. So RDD only records coarse-grained transformation: lineage.

RDD transformations will not be actually evaluated your RDD by Spark, only until it run actions on your RDD. So you can use


to println out RDD. If you want to println all RDD on one console, you can use following.


But it is not good when you have millions lines, so you can use following.


If you are running this on a cluster then println, which won’t print back to your context. You need to bring your RDD data to your seesion. In order to do this, you can force it to local array and then print it out.



Transformation will get a new RDD.

map[U](f: (T) => U): RDD[U]
filter(f: (T) => Boolean): RDD[T]
flatMap[U](f: (T) => TraversableOnce[U]): RDD[U]
sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]
union(others: RDD[T]): RDD[T]


Action will get a value or a result.

collect() // when dataset is small, it can use collect to package as a array.