Overview
The internet produces enormous datasets in places people don't think to look. Rule34.XXX is one of the largest publicly accessible image-board platforms, with millions of tagged entries and years of community-contributed metadata. The goal here wasn't the content. It was the dataset structure, tagging behaviour, and growth patterns. What does large-scale community data actually look like when you treat it with the rigour you'd give any enterprise source?
What I did
- · Built a scraper to extract metadata at scale through the public API: tags, upload dates, scores, counts
- · Loaded the full dataset into a normalised SQL schema with a proper fact/dimension model
- · Ran exploratory analysis on tag frequency, upload growth, and score vs. tag-count correlation
- · Mapped the long tail of community tagging: how a small fraction of tags accounts for most content
- · Visualised platform growth and content category evolution year over year
Key findings
- → The tag distribution is a textbook power law: 1% of tags cover the majority of posts
- → Upload volume spikes track cleanly against major entertainment release dates
- → Community scoring is a poor proxy for content diversity: high scores cluster in narrow categories
- → The dataset shape sits closer to an e-commerce product catalogue than to social media. A useful methodological parallel.
Stack
Python
SQL
requests
pandas
SQLite
matplotlib