569
submitted 1 month ago by Mindwolf@lemm.ee to c/technology@lemmy.world

Researchers published a massive database of more than 2 billion Discord messages that they say they scraped using Discord’s public API. The data was pulled from 3,167 servers and covers posts made between 2015 and 2024, the entire time Discord has been active.

Though the researchers claim they’ve anonymized the data, it’s hard to imagine anyone is comfortable with almost a decade of their Discord messages sitting in a public JSON file online. Separately, a different programmer released a Discord tool called "Searchcord" based on a different data set that shows non-anonymized chat histories.

top 50 comments
sorted by: hot top controversial new old
[-] asbestos@lemmy.world 252 points 1 month ago

Probably our only chance to find solutions to problems with open source software that uses Discord as their forum

[-] boatswain@infosec.pub 139 points 1 month ago

Seriously. It's beyond painful when some open source project only uses Discord for communication. You have to hope that you post your question at a time when the right people are online, and that there's not a more interesting conversation going on, otherwise it just gets lost. Index that whole dataset.

[-] ALostInquirer@lemm.ee 18 points 1 month ago

Given some similar issues, why is it some projects still use IRC then?

[-] Quill7513@slrpnk.net 53 points 1 month ago

there's a difference between using irc for livetime troubleshooting and not having a forum at all and directing everyone to your livechat discord. i'm sure some sicko out there has run an OSS project on only IRC, but their project likely got no traction because a history of problemsolving posts is important in open source. generally speaking, you need:

  • a wiki
  • a static indexable searchable forum
  • a live chat place for real time communication for novel problems

too many projects these days only have that last one in the form of discord

[-] boatswain@infosec.pub 12 points 1 month ago

That would be equally annoying. Probably a better signal to noise ratio on IRC though; Discord descends into memes almost instantly.

[-] AugustWest@lemm.ee 9 points 1 month ago

For projects I am involved with all irc chats are archived and searchable. There is nothing private, no registration needed and searchable.

Quite a bit different.

[-] Peffse@lemmy.world 6 points 1 month ago

I've always wanted to contribute to The Cutting Room Floor wiki but they hide registration behind a Discord server bot that will give the registration code.

[-] Ulrich@feddit.org 2 points 1 month ago* (last edited 1 month ago)

Index that whole dataset

I've seen a few projects doing just that with answeroverflow.com and they have come up in my web searches. Not really a solution but at least a stopgap.

[-] Dojan@pawb.social 18 points 1 month ago

I spent nearly three hours today between discord and matrix trying to figure out how to get these two pieces of software to talk using a certain protocol.

Imagine if there were online indexable platforms where people could publish this information so it’s easily accessible rather than having to scour through message logs hoping to find the right keywords. Such a technology surely doesn’t exist already, right?

I hate discord.

[-] dual_sport_dork@lemmy.world 36 points 1 month ago

I don't hate Discord, I simply hate that so many projects and companies have unanimously decided to use it as the wrong tool for the wrong job.

It's fine for its intended use case, which is bickering with my friends about video games and fiction, and spamming each other with .gifs and meme images.

[-] MBech@feddit.dk 20 points 1 month ago

Discord is genuinely a great tool for what I used to use Skype for. Talking to my friends, and sharing dumb memes with them in a groupchat format. Companies need to learn that using it as a forum, a Q&A service, a wiki or any other information sharing purpose, is simply fucking retarded.

[-] MDCCCLV@lemmy.ca 4 points 1 month ago

Yeah, but then you have something like when people protest deleted their history on reddit which is fine as a protest tactic but leaves a hole where your specific question came up but now there's nothing there.

[-] spiderhamster@lemmy.world 1 points 1 month ago

you get it to work? i didnt have time to get it working in both directions. matrix to discord worked fine but not the other way.

load more comments (1 replies)
[-] nawa@lemmy.world 14 points 1 month ago

Lol, I've read this headline and thought "thank fuck, probably the only option to have Discord's content readable", I like how universal this opinion is

[-] Metz@lemmy.world 120 points 1 month ago

So basically discord finally got a usable search. I count that as a win.

[-] donuts@lemmy.world 93 points 1 month ago

Well yeah, it's not encrypted. It would be the same as 10 years of Reddit posts or Lemmy posts scraped

[-] Quibblekrust@thelemmy.club 5 points 1 month ago

There's literally no difference. Each Discord server is like a tiny chunk of Reddit. If anyone expected any privacy on these servers, they're nuts.

[-] melroy@kbin.melroy.org 3 points 1 month ago

It's indeed not a miracle.

[-] CosmoNova@lemmy.world 50 points 1 month ago

That’s good news. Internet archiving is an important endeavor because you never know when they‘ll pull the plug. Now it‘s a little more secured and probably far more useful than in Discord‘s hands alone.

[-] Mustakrakish@lemmy.world 7 points 1 month ago

Not for messages that are supposed to be private lol. Let me just make a copy of all texts you've sent over the last decade, for "archiving".

[-] MangoPenguin 17 points 1 month ago

This says it was done via the API so they wouldn't be private messages.

[-] nomy@lemmy.zip 3 points 1 month ago

Texts are sent in plain-text and I wouldn't recommend discussing anything you'd like to keep private via text.

[-] shaggyb@lemmy.world 2 points 1 month ago

If you think messages you post anywhere on the internet are private, you're in for a bad time.

[-] FaceDeer@fedia.io 40 points 1 month ago

If they aren't comfortable with their Discord messages being public, perhaps they shouldn't have posted those messages in a public forum that the public can access.

[-] sp3ctr4l@lemmy.dbzer0.com 26 points 1 month ago

So this is:

'Uh guys, Discord chats leaked..."

For... what, just literally everyone who used Discord between 2015 and 2017, everyone who was an early adopter?

Dear fucking god.

I used to say 'someday, people will learn', but fucking no obviously not, no they won't, almost everyone is an idiot and/or truly doesn't care.

... I guess this'll be fodder for a whole bunch of dramatubers / pedohunters for the next year or so...

[-] DanWolfstone@leminal.space 1 points 1 month ago

It wasn't the chats though. It was public servers that can be found through the discovery tab. I would love to be up and arms about this and convince people to switch but.. Looking at it objectively, this isn't terribly different from if they'd archived public subreddits and their posts.

[-] fullsquare@awful.systems 18 points 1 month ago

🚩

marked safe

from Brazilian mass discord message leak

(never used discord)

[-] Entertain529@lemmy.ml 16 points 1 month ago

Saving this article for the next time someone says "Just message me on discord its easier".

[-] CosmicTurtle0@lemmy.dbzer0.com 15 points 1 month ago
[-] lefixxx@lemmy.world 15 points 1 month ago

"scraped" via API? I don't think It means what you think it means.

[-] snowsuit2654 14 points 1 month ago* (last edited 1 month ago)

"anonymized" sure. I highly doubt they read every message. I'm sure there is lots of de-anonymizing information in the messages themselves

For example--

Anon1: "hey jeff, wanna play Minecraft?"

Anon2: "sure"

Thus we know Anon2's name is Jeff. I imagine there's a lot of this.

load more comments (1 replies)
[-] Samsy@lemmy.ml 14 points 1 month ago

Meanwhile AI scrapers: This will be a fine addition to my collection.

[-] joyjoy@lemm.ee 12 points 1 month ago

Public data should be accessible anonymously. You can't change my mind.

[-] toastmeister@lemmy.ca 11 points 1 month ago

Great news for open source AI.

[-] ABetterTomorrow@lemm.ee 10 points 1 month ago

wtf…… going to get worse after IPO!

[-] gwilikers@lemmy.ml 10 points 1 month ago* (last edited 1 month ago)

So how does this work? Like how did they get those messages through API calls? Also, is this not something that Discord would dislike since it dilutes the value of their data horde?

[-] thesohoriots@lemmy.world 10 points 1 month ago

They just wanted to find new slurs.

[-] Reygle@lemmy.world 10 points 1 month ago

Ooh! Do Teams next

[-] pelespirit@sh.itjust.works 6 points 1 month ago

Every time you post, you're posting so that Meta, Google, Reddit and every known retail store like Walmart, Target, Kroger, etc. can see it because they bought that info or harvested it themselves. I think these are great announcements so people can see who sees and manipulates you with your own contributions of data.

[-] dubyakay@lemmy.ca 4 points 1 month ago

The feedback loop is everywhere in tech.

[-] obbeel@lemmy.eco.br 5 points 1 month ago

This is just trolling, at this point.

[-] MrSoup@lemmy.zip 3 points 1 month ago

I can't find this "public" json

[-] obbeel@lemmy.eco.br 7 points 1 month ago
[-] MrSoup@lemmy.zip 2 points 1 month ago

Thanks, but the file seems to be restricted.

[-] CosmicTurtle0@lemmy.dbzer0.com 4 points 1 month ago* (last edited 1 month ago)

I skimmed through their paper and I can't seem to find the instructions to download the dataset.

I found this particularly cute:

This study introduces the Discord Unveiled Dataset, a comprehensive and ethically curated resource encompassing over 3,000 public servers and 2 billion messages exchanged on Discord.

[-] arararagi@ani.social 3 points 1 month ago

If they were on OPEN servers, I doubt they cared that much.

[-] reiterationstation@lemm.ee 2 points 1 month ago

I was hoping people would do this!!!

[-] dgdft@lemmy.world 1 points 1 month ago

I was hoping to play around with the dataset over the weekend to toy with some text-embedding techniques, but they’ve pulled the cord on the download links.

Anyone have a copy of the full archive they’re willing to share, or a magnet link?

[-] pewgar_seemsimandroid 1 points 1 month ago

404? another source please? I don't trust them on this exact thing.

load more comments
view more: next ›
this post was submitted on 21 May 2025
569 points (100.0% liked)

Technology

71842 readers
3782 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS