david_s_data's comments

david_s_data · 2026-04-07T19:57:57 1775591877

Incredibly useful, thank you.

david_s_data · 2026-04-07T13:16:10 1775567770

Hi everyone. I've been spending a lot of time looking at UK real estate data and realized that the actual valuable stuff (like the specific reasons why a council rejects a planning application) is buried in unstructured PDFs.

I decided to build an extraction pipeline to pull the policy breaches, officer notes and timelines, etc. out of those PDFs and into a clean CSV. I also had to write a quick script to strip out all the exact addresses and names down to the postcode level to avoid GDPR issues.

I just put a 50-row sample of the schema up on Kaggle. Before I burn money on compute to scale this to 10,000+ rows across London, I'd really appreciate a sanity check from anyone who works with spatial or proptech data. Are there any obvious columns or data points I'm completely missing here?