I Downloaded All Data from Rule34.XXX

Overview

The internet produces enormous datasets in places people don't think to look. Rule34.XXX is one of the largest publicly accessible image-board platforms, with millions of tagged entries and years of community-contributed metadata. The goal here wasn't the content. It was the dataset structure, tagging behaviour, and growth patterns. What does large-scale community data actually look like when you treat it with the rigour you'd give any enterprise source?

What I did

· Built a scraper to extract metadata at scale through the public API: tags, upload dates, scores, counts
· Loaded the full dataset into a normalised SQL schema with a proper fact/dimension model
· Ran exploratory analysis on tag frequency, upload growth, and score vs. tag-count correlation
· Mapped the long tail of community tagging: how a small fraction of tags accounts for most content
· Visualised platform growth and content category evolution year over year

Key findings

→ The tag distribution is a textbook power law: 1% of tags cover the majority of posts
→ Upload volume spikes track cleanly against major entertainment release dates
→ Community scoring is a poor proxy for content diversity: high scores cluster in narrow categories
→ The dataset shape sits closer to an e-commerce product catalogue than to social media. A useful methodological parallel.

Stack

Python SQL requests pandas SQLite matplotlib

More Research

(←) Previous

I analyzed 11,463 D&D Monsters, all Spells and Items. You are playing wrong.

Next (→)

To Predict the Apocalypse I Stole NASA's Asteroid Data

I Downloaded All Data from Rule34.XXX (For Science)

Overview

What I did

Key findings

Stack