Why LLMs suck at OCR (Pulse AI Blog)

Original link: Why LLMs suck at OCR (Pulse AI Blog)

My notes

This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.

When an LLM processes a document image, it first embeds it into a high-dimensional vector space through the attention mechanism. This transformation is lossy by design.

As amazing as LLM results appear to be, I can’t help but wonder why is this even a direction that we are headed in? Shouldn’t traditional computer vision be a better option given how spatially oriented PDFs are?

I have tried NanoNets before. I don’t know what their tech stack is but it seems to be the most reliable solution out there right now. I guess this Pulse AI might be just as good but I can’t see a free trial on their website so can’t try it.

A few months ago I investigated CSV cleaning with LLMs and good lord that is an unsolved problem. I thought it’d be an easy one to solve but I guess containing the LLM within the guardrails isn’t that easy.

I am sure someone somewhere will solve both these problems in another year or so. That would be fun. Data cleaning will finally be a solved problem.