WBIRK: i have used many Rpis variants, my comments were intended as general remarks.
I have used Zeros, and found that (being only 1 core) they get quickly overloaded if multi-tasking, so a dedicated machine is really needed for realtime tasks (obvious really -LOL).
Rpis-B are fine, as too are 3's although they do have limits that can all be reached. Pushing the 3 to its limits, and found that with routines written in python and not complied are quite slow - especially using any computations inside the code loop.
Running 128 microsteps and ruining max slews was an issue if I was also tracking the stepper counts within the software. The fastest method/approach was to use the PWM output from a pi (with the frequency set from python) to drive the step input on the motor driver, and then read the PWM signal as an input on an interrupt basis, and count the pulses with an interrupt routine. This gave the best performance (and could get very high motor speeds), but i found that the interrupt routine often dropped pulses and therefore count accuracy suffered.

I haven't yet attempted complied C code,but I am guessing that the limits would not pose problems, as the its a lot faster at execution. I also never exploited the use of CORES in python, which I might expect would help to dedicate tasks to processors. I' also new to python, and very rusty with C, so its all been a re-learning curve for me!