352

submitted 11 months ago by Interstellar_1 to c/curatedtumblr@sh.itjust.works

63 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] ChaoticNeutralCzech@feddit.org 3 points 11 months ago* (last edited 11 months ago)

I know that similar computational problems use indexing and vector-space representation but how would you build an index of TiBs of almost-random data that makes it faster to find the strictly closest match of an arbitrarily long sequence? I can think of some heuristics, such as bitmapping every occurrence of any 8-pair sequence across each kibibit in the list. A query search would then add the bitmaps of all 8-pair sequences within the query including ones with up to 2 errors, and using the resulting map to find "hotspots" to be checked with brute force. This will decrease the computation and storage access per query but drastically increase the storage size, which is already hard to manage.

However, efficient fuzzy string matching in giant datasets is an interesting problem that computer scientists must have encountered before. Can you find a good paper that works well with random, non-delimited data instead of just using the approach of word-based indices for human languages like Lucene and OpenFTS?

[-] sp3tr4l@lemmy.zip 5 points 11 months ago* (last edited 11 months ago)

As per my other post, this person isn't doing any of that.

But, since you asked for papers on generic matching algorithms, I found this during the silent conniption fit you sent me into after suggesting that some random tumblr user plugged a tumblr bot directly into a state of the art genomics db.

https://link.springer.com/article/10.1007/s11227-022-04673-3

Please note that while, yes, they ran this test on a standard office computer, they were only searching against 12 million characters.

A single tebibyte of characters would be more like 1 trillion characters. A pebibyte would be more like 1 ~~quintillion~~ quadrillion.

... much, much, much longer processing times.

Edit: Used the wrong word for stupendously large numbers that start with q.

[-] PotatoesFall@discuss.tchncs.de 1 points 11 months ago

Yeah good point, not a trivial undertaking. I'm not an expert in that area but maybe elasticsearch or similar technology is able to find matches. Although I have no idea how that works under the hood

this post was submitted on 25 Oct 2024

352 points (100.0% liked)

Curated Tumblr

6048 readers

395 users here now

For preserving the least toxic and most culturally relevant Tumblr heritage posts.

Here are some OCR tools to assist you in transcribing posts:

web
iOS
FOSS Android Recs per u/m_f@discuss.online: 1 , 2

Don't be mean. I promise to do my best to judge that fairly.

founded 2 years ago

MODERATORS

Apytele@sh.itjust.works

ethaver@kbin.earth