It might also introduce spurious data dependencies
Those need to be in the in smallest cache or a register anyway. If they are in registers, a modern, instruction reordering CPU will deal with that fine.
to store a bit you now need to also read the old value of the byte that it's in.
Many architectures read the cache line on write-miss.
The only cases I can see, where byte sized bools seems better, are either using so few that all fit in one chache line anyways (in which case the performance will be great either way) or if you are repeatedly accessing a bitvector from multiple threads, in which case you should make sure that's actually what you want to be doing.
That requires some form of self describing format and will probably look like a sparse matrix in the end.