Basic
create a RDD and partition it into 5.
sc.parallelize([('a', 3), ('b', 4), ('c', 9)], 5)
repartition() - this is very costly job, make sure you do this, reduce the number of partitions, not increase. Get the number of partitions using method rdd.getNumPartitions().
Optimized version of repartition is coalesce().
# if you are increasing the number of partitions use repartition()(performing full shuffle)
# if you are decreasing the number of partitions use coalesce() (minimizes shuffles)
Advanced
Partitioning is import to reduce the data copying over the network.
Useful only when the RDD is used multiple-times; for example in key-oriented operations, such as joins.
For example in case of a big table A, to join a small table B, both the tables are moved across the nodes. Instead partitioning the big table A, and shuffle only the same table B across the nodes, reduces network traffic and runs much faster.
Types of partitioning
- HashPartitioner – in case of pair RDD, all the keys are hashed, and keys with same hash value are stored in the same partition
- RangePartitioner – in case of pair RDD, same keys are stored in the same partition
Spark’s Java and Python APIs benefit from partitioning in the same way as the Scala API. However, in Python, you cannot pass a HashPartitioner object to partitionBy; instead, you just pass the number of partitions desired (e.g., rdd.partitionBy(100)).
*** Operation map(). lets forget the parent partition, by changing the key. But mapValues() and flatmapValues() doesnt change the key. so parent partition is not affected.
Operations that are benefit from partitioning
cogroup(), groupWith(), join(), leftOuterJoin(), rightOuter
Join(), groupByKey(), reduceByKey(), combineByKey(), and lookup().