What’s state of the art for using ML to understand html docs automatically when scraping them? I’ve tinkered with this: https://huggingface.co/docs/transformers/model_doc/markuplm and it seems useful, but ai and ml is changing so fast right now, anyone know what else is going on?
An example use case: given a url, figure out content type (article vs product page for example), if product page, automatically extract all product details and specs without manually mapping xpath or css lookup paths
Probably a URL and title and perhaps a bit of readable content would be enough for chatgpt 3 to classify the page type.
There was a show HN here a couple of days ago that used chatgpt to create a scraper based on a webpage. Hook the 2 together and you are basically there!
I keep bumping in to the context window size. I'm trying to figure out a "compression" step that I can use in the general case, but nothing's very satisfying so far.
The mozilla/readability library is a good first step though.
Yeah I have the same issue with context window size. Personally I'm just waiting for future LLM's with 10x-100x context window. However, someone else recently came up with this solution:
I feel like the next step is to quantize or otherwise downsize old tokens so that they can fit more in memory at once. Not sure what the implications of a mixed-float-size model would be.
Imagine a browser plugin that pops up a modal. I write a query related to the current page, eg "Please summarize this page in three paragraphs, and translate to Turkish" or "There's a recipe somewhere on this page. Please suggest some variations on the filling" or "Can you make me a list of all the people mentioned on this page". Whole page (or at least the meat of it) gets bundled up with the query and sent to openai. What I'm trying to build is a simple in-browser swiss army knife.
Yes, I could try to figure out which bits of the page need to be sent along with the prompt, but that's hard in the general case. Squeezing a bit more out of the prompt window by stripping out unnecessary boilerplate is easy by comparison. (Multi-page articles are another headache).
Due to the constrained context window this is indeed a problem. But I would say solving it by just increasing context window will be really brute forcing the issue. I hope we can come up with something better. I'm betting on embeddings for these kind of things in my personal projects. But that too seems like a jackhammer for a nail kind of thing for single web pages.
Compression in order for the input data to fit context, maybe? For example if context is 4092 and input size is 6000, figuring out the appropriate way to run an operation on all 6000 where context over all 6000 might be relevant to the operation.
Maybe I'm not getting it, but I see this as an indexing problem. The goal shouldn't be to fit the entire document in the prompt, we should include relevant part of the doc when we query it.
Embedding chunks and finding chunks based on similarity is definitely in use now. But if you can increase context size cheaply then the model can figure out what's relevant.
Yeah I get that, after all attension is all you need. But unless you want to spend a bunch of money on the 32k context version I don't think there are other options than embeddings and index.
I've worked extensively in this space. For those looking for just an OCR solution MSFT's offering "read" is by and far the most accurate. Key-value, table and other information extraction is a much harder problem. Anything that can go wrong in production will. Documents with extra pages, rotated, blacked out, fuzzy. There are many steps that go into making document extraction really e2e.
The biggest enterprise users are doing thousand+ of pages a minute and also turn document extraction into a scaling distributed systems problem
A few days ago, IBM announced a new OCR system[1]. Have you by chance compared it to Microsoft's offering? I'm currently looking for the best-in-class OCR solution for scanned PDF documents.
Call me biased, but I've learned over time that anything that comes out of the Waston team looks good only in PR statements but sucks at production - especially at tasks like OCR. YMMV.
We currently develop solutions in this area and I believe that isolated OCR is not the solution to go. Things are moving rapidly towards end-to-end processing of documents with huge transformer models and I also believe that multi-modal GPT models will quickly win all usecases.
If you guys are interested to work in that topic and are located in the northern Germany region, pop me a message.
I would like to extract text from approximately 2000 PDF files (machine generated, not scanned) in which the layout can be different on a file basis. Some have normal paragraphs, others two columns and even three columns. All contain tables, but I am not interested in them. Do you know a good (semi-)automatic solution for this?
this is a hard problem and will require an enterprise solution unfortunately. If its only 2000 pdfs you might be better outsourcing to an off-shore consulting agency to do it manually
Do you have any recommendations for OCR of receipts and grocery bills? I’ve dreamt of having a little app to analyse grocery spending and distribute bills among multiple people, but every time I checked, the state of receipt OCR was surprisingly too bad for this…
Last I checked I saw a grocery bill example using https://github.com/mindee/doctr and was fairly accurate. Bear in mind that was last year, hopefully it got even better or there are other libraries
Isn't it depressing, that we live in 2023 and the predominant document format is pdf, which was invented in 1993 and is optimized for printing? I would love to have a new format, which is easily parseable (like JSON) AND printable (like PDF).
at least PDF occasionally contains actual text. My organisation systematically scans everything to TIFF images for archival. So now we are embarking on a major project to OCR the TIFFs to get back the text (!).
My payroll statement is the same, image wrapped in a pdf document.
I’m not sure if they’re being intentionally annoying or if someone thought this was actually helpful for the thousands of independent contractors who track their expenses down to the penny?
I thinks it’s depressing that we’re still thinking of content being containerised as if it still had to be bound in a physical volume instead of being addressable items of information, like a computer naturally stores information.
I love this comment. It strikes at the heart of many things that I have been vocal about for decades at the same time I could take the devil's advocate approach to say: a computer naturally stores information on physical volumes, since these have different address spaces you will probably not get around this conundrum.
However, fundamentally I completely agree with you. Information we seek should not be bound to the medium it is stored on in this day an age. I wish we could get out of the containerized knowledge but it seems to me we are creating ever more virtual containers in which information is stored. I for one only get a glimpse of the vast amounts of information TikTok is making available to it's users when it is posted on one of the few websites I visit.
I guess the reason we still think of information being in books and on paper is because we are human and its hard to get rid of millennia of habits and institutions that have grown around us to accommodate for our limited ability to grasp the universe.
I would pay for a simple, competent anything-to-markdown API. Something that could convert PDFs to high quality markdown with tables, etc. I'm using Document AI from Google right now and the ergonomics are awful.
Currently OCR support is limited to PDF > TXT conversion but we're hoping to add support for other output formats at some point. Feel free to shoot me an email at chris [at] zamzar [dot] com if you'd like to chat further.
How serendipitous. I was looking for something like that recently. Admittedly my use case was much simpler, detecting tables of contents on scanned pdfs that usually don’t have them as links to navigate within the document. Will see if this could help. Is anyone using something else for my use case?
I personally use Azure, combined with OCR correction using GPT to convert a scan of my daily journal (Apple Notes creates a PDF that is nothing but a bunch of images) -> Markdown -> Extract tasks and then add them to my Reminders app using CalDav. Azure has one of the best OCR for handwritten text, but for normal document extraction (read: printed text), any service would do a reasonable job.
This is the prompt I bought from promptbase. You basically provide GPT with some examples on possible OCR errors, and then you give it the OCRed text and it tries to correct it
I’ve used this repo. It’s ok. For very simple layouts, it probably works fine. For more complex layouts it fails miserably. I’ve also had cases where it didn’t detect half the text on the page (machine generated text).
An LLM is useless (or not as useful) in OCR for forms where we are trying to extract the name "John Smith" from a name field whereas a HMM trained specifically on name fields may be able to do a better job.
LLMs may perform better as a post processing step for running text (such as book pages).
For what its worth, very high quality OCR from Google's Vision offering costs $0.0015 per page, with 1000 free pages per month. In my experience, it has been signficantly superior to any open source solution.
An example use case: given a url, figure out content type (article vs product page for example), if product page, automatically extract all product details and specs without manually mapping xpath or css lookup paths