[-] 19_84@lemmy.dbzer0.com 5 points 2 days ago

if you didn't notice, this project was released into the public domain

[-] 19_84@lemmy.dbzer0.com 5 points 2 days ago

those are not split by subreddit so they will not work with the tool

[-] 19_84@lemmy.dbzer0.com 3 points 2 days ago

this is one reason i support tor deployment out of the box ๐Ÿ˜‹

[-] 19_84@lemmy.dbzer0.com 6 points 2 days ago

there are the so called activists that complain alot then there are the activists that deliver projects and code... enough said

[-] 19_84@lemmy.dbzer0.com 5 points 2 days ago* (last edited 2 days ago)

that was exactly the idea, thanks for understanding..

also reddit's ban on vpn also reddit's mandatory id verification

and the list goes on..

[-] 19_84@lemmy.dbzer0.com 44 points 3 days ago

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

[-] 19_84@lemmy.dbzer0.com 28 points 3 days ago

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

[-] 19_84@lemmy.dbzer0.com 25 points 3 days ago

the torrent has data for the top 40,000 subs on reddit. thanks to watchful1 splitting the data by subreddit, you can download only the subreddit you want from the torrent ๐Ÿ™‚

[-] 19_84@lemmy.dbzer0.com 44 points 3 days ago

thanks anyway for looking at my project ๐Ÿ™‚

[-] 19_84@lemmy.dbzer0.com 144 points 3 days ago

Yes I used AI, English is not my first language. Thank you for the kind words!

[-] 19_84@lemmy.dbzer0.com 33 points 3 days ago

Yes! Too many comments to count in a reasonable amount of time!

696

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL โ€“ still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints โ€“ posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" โ€“ it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

view more: next โ€บ

19_84

joined 1 week ago