I don't know of any cross-library comparison benchmark. Also, if a library does ship with an internal benchmark, it usually uses a different number of data/parity shards and different shard size, compared to other library benchmarks, which makes comparison difficult.
I wrote an MIT-licensed ReedSolomon module for Node [1], with native C++ multi-core throughput. The internal benchmark (node benchmark.js) provides latency and throughput measurements for data=10, parity=4 for a variety of shard sizes, and the number of data/parity shards can be adjusted for comparison.
It's not yet using optimized assembler or SIMD like Klaus Post's implementation, but it is compatible and also based on Backblaze's implementation, as well as including a pure Javascript implementation for browser use.
In my library, the SSSE3 code also has AVX and AVX2 equivalents, providing higher throughput. There are also implementations with specific optimized instructions for ARM NEON, ARM64, PPC64/AltiVec and Power8. All of this in C (with intrinsics), not ASM, so making use of your C compiler's instruction scheduling, loop unrolling,... See https://github.com/NicolasT/reedsolomon/blob/master/cbits/re...
The one you linked to is excellent. It has hand crafted SSE3 assembler for enormous speed. It is a bit difficult to compare the benchmarks from the one in the original posting to Klaus's as they aren't done at the same (N,K) parameters though.
I believe it's compatible with the Java implementation used by Backblaze and others.