For sake of this explanation, let's use only 8 bits for addresses (256 byte address space). We'll cache 32 16-bit words (64 bytes cache).
Let's start with the simplest cache: direct mapped. Given an address, we will take that address mod 32 to decide which of the 32 cache lines it belongs in (so address 01001101 would go in cache line 00110 = 6). Note that the last bit is ignored, because we always work with 16 bits / 2 bytes at a time. To check if something is in cache requires comparing only one tag to the address we want (very little hardware). Implementing this cache is simple, because whenever 2 different memory locations share the same 5 bits, only one can be stored at a given time, and when we want to cache a new address, we have no choice but to overwrite the only entry currently in cache where it can be placed. What we sacrifice: if we want to access location 01001101, and then 10001101, we have to drop 01001101 from the cache and replace it with location 10001101. If our program tends to access locations spaced like that, we will have awful cache performance.
The other extreme is a fully associative cache. In this situation, we can put any address in any cache line. To check if we have something in cache or not, we have to compare ALL 32 tags - doing that requires more time and more chip area. Also, when we want to cache a new memory location, we have to decide which of the existing entries to drop, since we are no longer restricted to only 1 possible location. Doing this well requires keeping track of when the various entries were accessed, and finding the least-recently-used (LRU) entry. That also eats a lot of hardware. However, we can cache ANY 32 words in memory at any access pattern, and the hit rate will be extremely high. Note that completely-accurate LRU is too difficult, so usually there are approximations that are used.
Set associativity is the happy middle. In an n-way set associative cache, a given memory location can be put in one of n locations. The LRU hardware is simpler because you're only keeping track of a few entries for each set. While you can't cache ANY 32 words, you can cache a lot of the subsets. You only need n compares for each lookup, so the hardware there isn't too bad either. More associativity is generally faster, but costs more.