DeepDoctection: Document extraction and analysis using deep learning models

dchuk · on April 27, 2023

What’s state of the art for using ML to understand html docs automatically when scraping them? I’ve tinkered with this: https://huggingface.co/docs/transformers/model_doc/markuplm and it seems useful, but ai and ml is changing so fast right now, anyone know what else is going on?

An example use case: given a url, figure out content type (article vs product page for example), if product page, automatically extract all product details and specs without manually mapping xpath or css lookup paths

esquire_900 · on April 27, 2023

Probably a URL and title and perhaps a bit of readable content would be enough for chatgpt 3 to classify the page type.

There was a show HN here a couple of days ago that used chatgpt to create a scraper based on a webpage. Hook the 2 together and you are basically there!

ShamelessC · on April 27, 2023

GPT-4 presumably.

flir · on April 27, 2023

I keep bumping in to the context window size. I'm trying to figure out a "compression" step that I can use in the general case, but nothing's very satisfying so far.

The mozilla/readability library is a good first step though.

reaperman · on April 27, 2023

Yeah I have the same issue with context window size. Personally I'm just waiting for future LLM's with 10x-100x context window. However, someone else recently came up with this solution:

https://news.ycombinator.com/item?id=35488291

bobbylarrybobby · on April 27, 2023

I feel like the next step is to quantize or otherwise downsize old tokens so that they can fit more in memory at once. Not sure what the implications of a mixed-float-size model would be.

newswasboring · on April 27, 2023

Why do you want to compress this data? What's the final use case here?

flir · on April 27, 2023

Imagine a browser plugin that pops up a modal. I write a query related to the current page, eg "Please summarize this page in three paragraphs, and translate to Turkish" or "There's a recipe somewhere on this page. Please suggest some variations on the filling" or "Can you make me a list of all the people mentioned on this page". Whole page (or at least the meat of it) gets bundled up with the query and sent to openai. What I'm trying to build is a simple in-browser swiss army knife.

Yes, I could try to figure out which bits of the page need to be sent along with the prompt, but that's hard in the general case. Squeezing a bit more out of the prompt window by stripping out unnecessary boilerplate is easy by comparison. (Multi-page articles are another headache).

newswasboring · on April 27, 2023

Due to the constrained context window this is indeed a problem. But I would say solving it by just increasing context window will be really brute forcing the issue. I hope we can come up with something better. I'm betting on embeddings for these kind of things in my personal projects. But that too seems like a jackhammer for a nail kind of thing for single web pages.

jerrygenser · on April 27, 2023

Compression in order for the input data to fit context, maybe? For example if context is 4092 and input size is 6000, figuring out the appropriate way to run an operation on all 6000 where context over all 6000 might be relevant to the operation.

newswasboring · on April 27, 2023

Maybe I'm not getting it, but I see this as an indexing problem. The goal shouldn't be to fit the entire document in the prompt, we should include relevant part of the doc when we query it.

Edit: I'm thinking of something like LlamaIndex

jerrygenser · on April 27, 2023

Embedding chunks and finding chunks based on similarity is definitely in use now. But if you can increase context size cheaply then the model can figure out what's relevant.

newswasboring · on April 27, 2023

Yeah I get that, after all attension is all you need. But unless you want to spend a bunch of money on the 32k context version I don't think there are other options than embeddings and index.

cced · on April 27, 2023

What is the state of the art for running models locally in terms of context size?

newswasboring · on April 27, 2023

I think local models SOTA is llama which has 2048 context[1].

[1] https://github.com/facebookresearch/llama/issues/16

exhibitapp · on April 27, 2023

I've worked extensively in this space. For those looking for just an OCR solution MSFT's offering "read" is by and far the most accurate. Key-value, table and other information extraction is a much harder problem. Anything that can go wrong in production will. Documents with extra pages, rotated, blacked out, fuzzy. There are many steps that go into making document extraction really e2e.

The biggest enterprise users are doing thousand+ of pages a minute and also turn document extraction into a scaling distributed systems problem

idealism · on April 27, 2023

A few days ago, IBM announced a new OCR system[1]. Have you by chance compared it to Microsoft's offering? I'm currently looking for the best-in-class OCR solution for scanned PDF documents.

[1]: https://www.ibm.com/cloud/blog/exploring-ibms-new-optical-ch...

alsodumb · on April 27, 2023

Call me biased, but I've learned over time that anything that comes out of the Waston team looks good only in PR statements but sucks at production - especially at tasks like OCR. YMMV.

benjaminva · on April 27, 2023

We currently develop solutions in this area and I believe that isolated OCR is not the solution to go. Things are moving rapidly towards end-to-end processing of documents with huge transformer models and I also believe that multi-modal GPT models will quickly win all usecases. If you guys are interested to work in that topic and are located in the northern Germany region, pop me a message.

pvitz · on April 27, 2023

I would like to extract text from approximately 2000 PDF files (machine generated, not scanned) in which the layout can be different on a file basis. Some have normal paragraphs, others two columns and even three columns. All contain tables, but I am not interested in them. Do you know a good (semi-)automatic solution for this?

exhibitapp · on April 27, 2023

this is a hard problem and will require an enterprise solution unfortunately. If its only 2000 pdfs you might be better outsourcing to an off-shore consulting agency to do it manually

pvitz · on April 27, 2023

Thanks for the reply, good to know that!

9dev · on April 27, 2023

Do you have any recommendations for OCR of receipts and grocery bills? I’ve dreamt of having a little app to analyse grocery spending and distribute bills among multiple people, but every time I checked, the state of receipt OCR was surprisingly too bad for this…

puika · on April 27, 2023

Last I checked I saw a grocery bill example using https://github.com/mindee/doctr and was fairly accurate. Bear in mind that was last year, hopefully it got even better or there are other libraries

j45 · on April 27, 2023

This is a really helpful find thanks.

If there are any other libraries folks have seen out there like this, I’d love to try them out.

themantalope · on April 27, 2023

The paddlepaddle project has nice models. Not well documented though and can be hard to use, so proceed at your own risk. But it is popular.

d911 · on April 27, 2023

i am using epap. It has a pretty good OCR and you can export in CSV. https://apps.apple.com/de/app/epap-kassenbon-haushaltsbuch/i...

nextworddev · on April 27, 2023

Do they have human workers for those hard to solve cases in the loop?

exhibitapp · on April 27, 2023

Yes the solution i worked on had an interface for HITL

tevlon · on April 26, 2023

Isn't it depressing, that we live in 2023 and the predominant document format is pdf, which was invented in 1993 and is optimized for printing? I would love to have a new format, which is easily parseable (like JSON) AND printable (like PDF).

therealmarv · on April 27, 2023

Asciidoc, HTMLbook, Docbook are the standards O'Reilly media using to create their books (and PDFs). No need to reinvent the wheel there.

zmmmmm · on April 27, 2023

at least PDF occasionally contains actual text. My organisation systematically scans everything to TIFF images for archival. So now we are embarking on a major project to OCR the TIFFs to get back the text (!).

Baeocystin · on April 27, 2023

Considering the data format hells I've had to deal with over the years, straightforward TIFF scans don't sound so bad, honestly.

UncleEntity · on April 27, 2023

My payroll statement is the same, image wrapped in a pdf document.

I’m not sure if they’re being intentionally annoying or if someone thought this was actually helpful for the thousands of independent contractors who track their expenses down to the penny?

mr_toad · on April 27, 2023

I thinks it’s depressing that we’re still thinking of content being containerised as if it still had to be bound in a physical volume instead of being addressable items of information, like a computer naturally stores information.

vls-xy · on April 27, 2023

I love this comment. It strikes at the heart of many things that I have been vocal about for decades at the same time I could take the devil's advocate approach to say: a computer naturally stores information on physical volumes, since these have different address spaces you will probably not get around this conundrum.

However, fundamentally I completely agree with you. Information we seek should not be bound to the medium it is stored on in this day an age. I wish we could get out of the containerized knowledge but it seems to me we are creating ever more virtual containers in which information is stored. I for one only get a glimpse of the vast amounts of information TikTok is making available to it's users when it is posted on one of the few websites I visit.

I guess the reason we still think of information being in books and on paper is because we are human and its hard to get rid of millennia of habits and institutions that have grown around us to accommodate for our limited ability to grasp the universe.

boredemployee · on April 27, 2023

Create one for us :D

o_____________o · on April 27, 2023

I would pay for a simple, competent anything-to-markdown API. Something that could convert PDFs to high quality markdown with tables, etc. I'm using Document AI from Google right now and the ergonomics are awful.

whyleyc · on April 27, 2023

If you just need to convert the files have you thought about using Zamzar (https://dev.zamzar.com/)?

We have a file conversion API that supports DOC/DOCX/ODT/PDF/TEX to Markdown conversion in one line of cURL (or you programming language of choice).

(Disclaimer: I'm the product lead for the Zamzar API).

o_____________o · on April 27, 2023

Thanks I'll check it out. What do you do with PDFs that lock text in images, are you using ML/OCR? And as mentioned, tables?

whyleyc · on April 27, 2023

Currently OCR support is limited to PDF > TXT conversion but we're hoping to add support for other output formats at some point. Feel free to shoot me an email at chris [at] zamzar [dot] com if you'd like to chat further.

gcr · on April 27, 2023

ABBYY reader is a fine human-in-the-loop solution. It can at least output from PDF to ePub, and you can imagine going to markdown from there.

bravura · on April 27, 2023

https://pd3f.com/ is a good PDF converter for the use-cases I tested (academic papers).

j45 · on April 27, 2023

Anything to markdown is broader than part to markdown

Since pdfs are created so many different ways, do you have some examples and links of the pdfs that are awful?

cgio · on April 26, 2023

How serendipitous. I was looking for something like that recently. Admittedly my use case was much simpler, detecting tables of contents on scanned pdfs that usually don’t have them as links to navigate within the document. Will see if this could help. Is anyone using something else for my use case?

ensemblehq · on April 27, 2023

Not sure what your budget is but I’ve used AWS for handling PDFs and it’s been pretty good at detecting content via boundary boxes.

anamexis · on April 27, 2023

AWS as in Amazon Web Services? And if so, can you be more specific?

navanchauhan · on April 27, 2023

Here are some more options:

* AWS Textract [0]

* Microsoft Azure [1]

* Google Cloud Vision [2]

I personally use Azure, combined with OCR correction using GPT to convert a scan of my daily journal (Apple Notes creates a PDF that is nothing but a bunch of images) -> Markdown -> Extract tasks and then add them to my Reminders app using CalDav. Azure has one of the best OCR for handwritten text, but for normal document extraction (read: printed text), any service would do a reasonable job.

[0] https://docs.aws.amazon.com/prescriptive-guidance/latest/pat...

[1] https://learn.microsoft.com/en-us/azure/data-factory/solutio...

[2] https://cloud.google.com/vision/docs/pdf

swsieber · on April 27, 2023

Could you explain how you use GPT for ocr correction?

navanchauhan · on April 27, 2023

https://promptbase.com/prompt/ocr-text-fixer

This is the prompt I bought from promptbase. You basically provide GPT with some examples on possible OCR errors, and then you give it the OCRed text and it tries to correct it

swsieber · on April 27, 2023

Fascinating. Thanks!

themantalope · on April 27, 2023

I’ve used this repo. It’s ok. For very simple layouts, it probably works fine. For more complex layouts it fails miserably. I’ve also had cases where it didn’t detect half the text on the page (machine generated text).

morisy · on April 27, 2023

Anything you find better for complex layouts?

fnordpiglet · on April 26, 2023

Interesting I was just thinking a LLM would do a great job correcting OCR mistakes.

vivegi · on April 27, 2023

OCR engines may use a HMM (Hidden Markov Model) for OCR correction.

fnordpiglet · on April 27, 2023

Yeah, but they don’t do a semantic correction. LLM are extraordinarily more powerful than HMM.

vivegi · on April 28, 2023

Depends on context though.

An LLM is useless (or not as useful) in OCR for forms where we are trying to extract the name "John Smith" from a name field whereas a HMM trained specifically on name fields may be able to do a better job.

LLMs may perform better as a post processing step for running text (such as book pages).

elpalek · on April 27, 2023

https://github.com/microsoft/table-transformer

This looks interesting as well. Haven't tested yet.

airocker · on April 26, 2023

I was just looking for OCR. How does this compare with easyOCR?

avibhu · on April 27, 2023

For what its worth, very high quality OCR from Google's Vision offering costs $0.0015 per page, with 1000 free pages per month. In my experience, it has been signficantly superior to any open source solution.

o_____________o · on April 27, 2023

Why this over Document AI?

airocker · on April 27, 2023

Thanks!