It's not 40% faster,it's 40% more IPC,that's throughput not speed,it will only be 40% faster if you actually find a software that will be able to use all 10 instructions the ZEN core has available per cycle.
Which will be pretty difficult since there aren't many CPUs out there (if there are any) with 10 instructions per core,I guess that's why they went with blender instead of some "traditional" benchmark.
First off, Zen can only sustain decode of four ISA ops per cycle - not ten. Obviously the uop cache will help there, some ops will crack to multiple uops, etc, but even measuring uops, it's a maximum of six per cycle (which is still not ten.) The 40% here is realistically calculated from generic ST workloads, likely integer-heavy ones. The purpose of having so many functional units is to allow a favorable instruction mix within a given machine width, not to somehow force you to use all ten per cyc to hit maximum performance.
Second, on the assessment that 10 execution pipes (which seems to be what you're referring to) is unusually many -
Power8 is 8-wide at the frontend, 10-issue, and has 16 execution pipes. P9-SMT8 is wider.
Multiflow Trace went up to 28-wide, front to back, in a VLIW uarch.
Intel Poulson is 12-issue in back, backing up a 6-wide frontend.