Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DeepDoctection: Document extraction and analysis using deep learning models (github.com/deepdoctection)
191 points by bpiche on April 26, 2023 | hide | past | favorite | 62 comments


What’s state of the art for using ML to understand html docs automatically when scraping them? I’ve tinkered with this: https://huggingface.co/docs/transformers/model_doc/markuplm and it seems useful, but ai and ml is changing so fast right now, anyone know what else is going on?

An example use case: given a url, figure out content type (article vs product page for example), if product page, automatically extract all product details and specs without manually mapping xpath or css lookup paths


Probably a URL and title and perhaps a bit of readable content would be enough for chatgpt 3 to classify the page type.

There was a show HN here a couple of days ago that used chatgpt to create a scraper based on a webpage. Hook the 2 together and you are basically there!


GPT-4 presumably.


I keep bumping in to the context window size. I'm trying to figure out a "compression" step that I can use in the general case, but nothing's very satisfying so far.

The mozilla/readability library is a good first step though.


Yeah I have the same issue with context window size. Personally I'm just waiting for future LLM's with 10x-100x context window. However, someone else recently came up with this solution:

https://news.ycombinator.com/item?id=35488291


I feel like the next step is to quantize or otherwise downsize old tokens so that they can fit more in memory at once. Not sure what the implications of a mixed-float-size model would be.


Why do you want to compress this data? What's the final use case here?


Imagine a browser plugin that pops up a modal. I write a query related to the current page, eg "Please summarize this page in three paragraphs, and translate to Turkish" or "There's a recipe somewhere on this page. Please suggest some variations on the filling" or "Can you make me a list of all the people mentioned on this page". Whole page (or at least the meat of it) gets bundled up with the query and sent to openai. What I'm trying to build is a simple in-browser swiss army knife.

Yes, I could try to figure out which bits of the page need to be sent along with the prompt, but that's hard in the general case. Squeezing a bit more out of the prompt window by stripping out unnecessary boilerplate is easy by comparison. (Multi-page articles are another headache).


Due to the constrained context window this is indeed a problem. But I would say solving it by just increasing context window will be really brute forcing the issue. I hope we can come up with something better. I'm betting on embeddings for these kind of things in my personal projects. But that too seems like a jackhammer for a nail kind of thing for single web pages.


Compression in order for the input data to fit context, maybe? For example if context is 4092 and input size is 6000, figuring out the appropriate way to run an operation on all 6000 where context over all 6000 might be relevant to the operation.


Maybe I'm not getting it, but I see this as an indexing problem. The goal shouldn't be to fit the entire document in the prompt, we should include relevant part of the doc when we query it.

Edit: I'm thinking of something like LlamaIndex


Embedding chunks and finding chunks based on similarity is definitely in use now. But if you can increase context size cheaply then the model can figure out what's relevant.


Yeah I get that, after all attension is all you need. But unless you want to spend a bunch of money on the 32k context version I don't think there are other options than embeddings and index.


What is the state of the art for running models locally in terms of context size?


I think local models SOTA is llama which has 2048 context[1].

[1] https://github.com/facebookresearch/llama/issues/16


I've worked extensively in this space. For those looking for just an OCR solution MSFT's offering "read" is by and far the most accurate. Key-value, table and other information extraction is a much harder problem. Anything that can go wrong in production will. Documents with extra pages, rotated, blacked out, fuzzy. There are many steps that go into making document extraction really e2e.

The biggest enterprise users are doing thousand+ of pages a minute and also turn document extraction into a scaling distributed systems problem


A few days ago, IBM announced a new OCR system[1]. Have you by chance compared it to Microsoft's offering? I'm currently looking for the best-in-class OCR solution for scanned PDF documents.

[1]: https://www.ibm.com/cloud/blog/exploring-ibms-new-optical-ch...


Call me biased, but I've learned over time that anything that comes out of the Waston team looks good only in PR statements but sucks at production - especially at tasks like OCR. YMMV.


We currently develop solutions in this area and I believe that isolated OCR is not the solution to go. Things are moving rapidly towards end-to-end processing of documents with huge transformer models and I also believe that multi-modal GPT models will quickly win all usecases. If you guys are interested to work in that topic and are located in the northern Germany region, pop me a message.


I would like to extract text from approximately 2000 PDF files (machine generated, not scanned) in which the layout can be different on a file basis. Some have normal paragraphs, others two columns and even three columns. All contain tables, but I am not interested in them. Do you know a good (semi-)automatic solution for this?


this is a hard problem and will require an enterprise solution unfortunately. If its only 2000 pdfs you might be better outsourcing to an off-shore consulting agency to do it manually


Thanks for the reply, good to know that!


Do you have any recommendations for OCR of receipts and grocery bills? I’ve dreamt of having a little app to analyse grocery spending and distribute bills among multiple people, but every time I checked, the state of receipt OCR was surprisingly too bad for this…


Last I checked I saw a grocery bill example using https://github.com/mindee/doctr and was fairly accurate. Bear in mind that was last year, hopefully it got even better or there are other libraries


This is a really helpful find thanks.

If there are any other libraries folks have seen out there like this, I’d love to try them out.


The paddlepaddle project has nice models. Not well documented though and can be hard to use, so proceed at your own risk. But it is popular.


i am using epap. It has a pretty good OCR and you can export in CSV. https://apps.apple.com/de/app/epap-kassenbon-haushaltsbuch/i...


Do they have human workers for those hard to solve cases in the loop?


Yes the solution i worked on had an interface for HITL


Isn't it depressing, that we live in 2023 and the predominant document format is pdf, which was invented in 1993 and is optimized for printing? I would love to have a new format, which is easily parseable (like JSON) AND printable (like PDF).


Asciidoc, HTMLbook, Docbook are the standards O'Reilly media using to create their books (and PDFs). No need to reinvent the wheel there.


at least PDF occasionally contains actual text. My organisation systematically scans everything to TIFF images for archival. So now we are embarking on a major project to OCR the TIFFs to get back the text (!).


Considering the data format hells I've had to deal with over the years, straightforward TIFF scans don't sound so bad, honestly.


My payroll statement is the same, image wrapped in a pdf document.

I’m not sure if they’re being intentionally annoying or if someone thought this was actually helpful for the thousands of independent contractors who track their expenses down to the penny?


I thinks it’s depressing that we’re still thinking of content being containerised as if it still had to be bound in a physical volume instead of being addressable items of information, like a computer naturally stores information.


I love this comment. It strikes at the heart of many things that I have been vocal about for decades at the same time I could take the devil's advocate approach to say: a computer naturally stores information on physical volumes, since these have different address spaces you will probably not get around this conundrum.

However, fundamentally I completely agree with you. Information we seek should not be bound to the medium it is stored on in this day an age. I wish we could get out of the containerized knowledge but it seems to me we are creating ever more virtual containers in which information is stored. I for one only get a glimpse of the vast amounts of information TikTok is making available to it's users when it is posted on one of the few websites I visit.

I guess the reason we still think of information being in books and on paper is because we are human and its hard to get rid of millennia of habits and institutions that have grown around us to accommodate for our limited ability to grasp the universe.


Create one for us :D


I would pay for a simple, competent anything-to-markdown API. Something that could convert PDFs to high quality markdown with tables, etc. I'm using Document AI from Google right now and the ergonomics are awful.


If you just need to convert the files have you thought about using Zamzar (https://dev.zamzar.com/)?

We have a file conversion API that supports DOC/DOCX/ODT/PDF/TEX to Markdown conversion in one line of cURL (or you programming language of choice).

(Disclaimer: I'm the product lead for the Zamzar API).


Thanks I'll check it out. What do you do with PDFs that lock text in images, are you using ML/OCR? And as mentioned, tables?


Currently OCR support is limited to PDF > TXT conversion but we're hoping to add support for other output formats at some point. Feel free to shoot me an email at chris [at] zamzar [dot] com if you'd like to chat further.


ABBYY reader is a fine human-in-the-loop solution. It can at least output from PDF to ePub, and you can imagine going to markdown from there.


https://pd3f.com/ is a good PDF converter for the use-cases I tested (academic papers).


Anything to markdown is broader than part to markdown

Since pdfs are created so many different ways, do you have some examples and links of the pdfs that are awful?


How serendipitous. I was looking for something like that recently. Admittedly my use case was much simpler, detecting tables of contents on scanned pdfs that usually don’t have them as links to navigate within the document. Will see if this could help. Is anyone using something else for my use case?


Not sure what your budget is but I’ve used AWS for handling PDFs and it’s been pretty good at detecting content via boundary boxes.


AWS as in Amazon Web Services? And if so, can you be more specific?


Here are some more options:

* AWS Textract [0]

* Microsoft Azure [1]

* Google Cloud Vision [2]

I personally use Azure, combined with OCR correction using GPT to convert a scan of my daily journal (Apple Notes creates a PDF that is nothing but a bunch of images) -> Markdown -> Extract tasks and then add them to my Reminders app using CalDav. Azure has one of the best OCR for handwritten text, but for normal document extraction (read: printed text), any service would do a reasonable job.

[0] https://docs.aws.amazon.com/prescriptive-guidance/latest/pat...

[1] https://learn.microsoft.com/en-us/azure/data-factory/solutio...

[2] https://cloud.google.com/vision/docs/pdf


Could you explain how you use GPT for ocr correction?


https://promptbase.com/prompt/ocr-text-fixer

This is the prompt I bought from promptbase. You basically provide GPT with some examples on possible OCR errors, and then you give it the OCRed text and it tries to correct it


Fascinating. Thanks!


I’ve used this repo. It’s ok. For very simple layouts, it probably works fine. For more complex layouts it fails miserably. I’ve also had cases where it didn’t detect half the text on the page (machine generated text).


Anything you find better for complex layouts?


Interesting I was just thinking a LLM would do a great job correcting OCR mistakes.


OCR engines may use a HMM (Hidden Markov Model) for OCR correction.


Yeah, but they don’t do a semantic correction. LLM are extraordinarily more powerful than HMM.


Depends on context though.

An LLM is useless (or not as useful) in OCR for forms where we are trying to extract the name "John Smith" from a name field whereas a HMM trained specifically on name fields may be able to do a better job.

LLMs may perform better as a post processing step for running text (such as book pages).


https://github.com/microsoft/table-transformer

This looks interesting as well. Haven't tested yet.


I was just looking for OCR. How does this compare with easyOCR?


For what its worth, very high quality OCR from Google's Vision offering costs $0.0015 per page, with 1000 free pages per month. In my experience, it has been signficantly superior to any open source solution.


Why this over Document AI?


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: