My guess at point 1: GMM:s have many more parameters than soft k-means, since you also need to estimate co-variance matrices. Actually soft k-means is a special case of a GMM in which you assume a fixed and tied co-variance matrix. Because the likelihood function is more complex, naively initializing a GMM is more prone to get you stuck in bad local minima, compared to soft k-means.

Further, looking at the curves, the more clusters you use, the better the classification. I would guess that with that many clusters, modeling the shape of the clusters would be less important, since you can always capture parts of the distribution with high complexity by using lots of simple clusters.

I don't have any intuition to why hard k-means would perform so much worse than soft k-means.