Problem 3:

I compared:

One node and one core, One node and four cores, One node and eight cores, Two nodes and four cores, Four nodes and two cores, among which

"One node and eight cores" has the best performance, its run time is: 0.027755s

Problem 4:

Max array size handled: 200000000

Handcrafted implementation (one node and eight cores):

Run Time: 0.108948 seconds

Reduction Results: 204296

MPI Reduce (one node and eight cores):

Run Time: 0.408144 seconds

Reduction Results: 204294

The results seems slightly different, but I do not have time to check my codes, it is 11:58 now : )