It's all data, whether that data is text, an image, audio, or a binary containing computer code.
Raw audio data is just a series of amplitudes. It has a bit depth (which says how many bits are in each amplitude sample) and a frequency (what is the change in time going from one amplitude to the next). Using those, you can convert it to an analog signal that can be played on a speaker. And if you use the same values to convert that signal back to digital, you end up with the same input signal (though with some random noise added and if you get unlucky and your sample phase lines up with the player's transition phase, you won't be able to extract the original signal, though it might sound similar). The multiple recordings help mitigate these issues.
Given that data format, any arbitrary file can be treated as raw sound that can be transmitted as analog audio.
The only real difference between this and other transfer methods we use to transfer files is that this involves a less reliable conversion from digital to analog back to digital because it wasn't designed to do that like USB, COM, wifi, etc connections are.
Microsoft is looking for it and it wouldn't surprise me if they are paying a decent penny for it to try to stop the Linux gaming momentum the deck is driving.
It's entirely irrelevant to me. I don't care what the specs are if it's just running Windows.