Table Transformer: a model for extracting tables from unstructured documents

jiggawatts · on April 26, 2023

Something that surprised me is how good GPT 4 is at extracting tables from unstructured data, or data with weird or messed up formatting.

For example, I copy-pasted some text from a HTML page where each table cell ended up on its own line, like so:

    Enable feature?
    Yes
    Expand things?
    No

This could be fixed with some multi-line regex, except that it was interleaved with headings that messed up the key-value pairings.

I fed this into GPT 4 and it correctly surmised which rows were headings, keys, and values. This is an easy task for a human, but a shocking thing to see a computer solve without years of programming effort put into solving this specific problem!

You've got to wonder how many of these single-purpose AI models like Table Transformer are going to be subsumed into LLMs. For example, Table Transformer comes with a bunch of labelled training data. Just point the variant of GPT 4 that has vision at the training data to make a tuned version. That should outperform a "small" model because it both understands what it's seeing and has been tuned on the special-purpose data set.

What I'm saying is that if you have a terabyte of specific training data and train a model on it, that's fine, but then you'll have a model with "1 TB of knowledge" at most. If you start with a pre-trained LLM with petabytes of knowledge crammed into it, then adding that 1 TB would give you the benefits of both, but the benefit of the petabyte vastly outstrips the extra terabyte!

With LLMs being quantized down to just a few gigabytes and able to run on mobile devices, I wonder if this is what the future of AI will look like. No more training models from scratch...

gharman · on April 26, 2023

Indeed! I built a system just last year with - count em - three parsers to deal with PDF table extraction, including one built on TableTransformer. And then when GPT4 came out I just copy pasted a PDF into it as-is and darned if it didn’t do at least as good a job.

Now I can’t do this in earnest because of document privacy issues but I’ve diving down the rabbit hole of how small can we go and still get decent results. Spoiler: gpt2 is too small. :-)

ekabod · on May 6, 2023

If you were asked to extract lists or tables from html pages only, how would you go?

I was thinking: a) use the metric used in TableTransformer to detect the structured data. b) use the Markup LM model, maybe mixed with TableTransformer. c) find a way to work directly with GPT4.

dang · on April 26, 2023

We probably shouldn't have both this and https://news.ycombinator.com/item?id=35719937 on the front page. Which is more interesting?

Edit: the other project seems to have had updates within the last 24 hours so I'm going to guess that that submission is the one that should 'win'. It of course would be fine to link to this other project from the comments there.

hwayne · on April 26, 2023

Neat! Microsoft has been trying to mainstream stuff like this recently: Excel now has an "import table from image" button and Powertoys has a built-in OCR widget.