Hacker Newsnew | past | comments | ask | show | jobs | submit | david_s_data's commentslogin

Incredibly useful, thank you.


Hi everyone. I've been spending a lot of time looking at UK real estate data and realized that the actual valuable stuff (like the specific reasons why a council rejects a planning application) is buried in unstructured PDFs.

I decided to build an extraction pipeline to pull the policy breaches, officer notes and timelines, etc. out of those PDFs and into a clean CSV. I also had to write a quick script to strip out all the exact addresses and names down to the postcode level to avoid GDPR issues.

I just put a 50-row sample of the schema up on Kaggle. Before I burn money on compute to scale this to 10,000+ rows across London, I'd really appreciate a sanity check from anyone who works with spatial or proptech data. Are there any obvious columns or data points I'm completely missing here?


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: