Feature selection is a process by which you drop features for different reasons. The main reason features tend to be dropped is because they are closely related to another feature, so you only need one of them. This makes algorithms train faster, reduces noise, and makes it easier to diagnose what the algorithm did.
PCA takes some N features and compresses them into N-n features. This process ALSO eliminates Collinearity completely, as the resulting, compressed features will be completely uncorrelated. However, calling PCA a feature selection algorithm is a bit untrue, because you have essentially selected none of your features, you have completely transformed them into something else.
It doesn't seem like such a stretch to conceptualize that if PCA assigns a tiny weight to a variable (assume all variables have been pre de-biased to mean 0, std dev 1) then it is saying it's a feature that doesn't contribute much to the overall prediction and therefore "deselecting" it by merging it with several other variables that it's correlated with, and downweighting it relative to them.
Most successful techniques I see in deep nets take the incoming features and mux them into intermediate features which are the actual ones being learned. Feature selection and PCA are in a sense just built into the network.
Short answer: feature selection is one particular method of dimensionality reduction.
And most people when they say feature selection mean deliberate domain-driven selection of features.
That is to say, you can also create an entirely new synthesized feature from, say, 5 raw features and use it to replace those 5 features (this is ... PCA-esque.)
Or you could also use random forest techniques for example which just arbitrarily reduce the dimensions / features of each individual decision tree in the forest.
I agree many people here are making mountains out of molehills of terminology.