Llama-OCR: Document to Markdown

lapnect | 293 points | 7mon ago | llamaocr.com

nutlope|7mon ago

Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.

Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!

nh2|7mon ago

I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.

Is this amount of larger transformation expected/desirable?

(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)

zainia|7mon ago

Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...

rch|7mon ago

I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.

Have you considered that usage yet?

Szpadel|7mon ago

> Need an example image? Try ours. Great idea, I wish more services would have similar feature

gcr|7mon ago

How accurate is this?

When compared with existing OCR systems, what sorts of mistakes does it make?

Curiositry|7mon ago

Option to use a local LLM?

Eisenstein|7mon ago

I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.

* https://github.com/jabberjabberjabber/LLMOCR

nirav72|7mon ago

MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.

timmattison|7mon ago

I love this. Can you share the source?

Eisenstein|7mon ago

All it does is send the image to Llama 3.2 Vision and ask for it to read the text.

Note that this is just as open to hallucination as any other LLM output, because what it is doing is not reading the pixels looking for text characters, but describing the picture, which uses the images it trained on and their captions to determine what the text is. It may completely make up words, especially if it can't read them.

M4v3R|7mon ago

This is also true for any other OCR system, we just never called these errors “hallucinations” in this context.

geysersam|7mon ago

I gave this tool a picture of a restaurant menu and it made up several additional entries that didn't exist in the picture... What other OCR system would do that?

noduerme|7mon ago

No, it's not even close to OCR systems, which are based on analyzing points in a grid for each character stroke and comparing them with known characters. Just for one thing, OCR systems are deterministic. Deterministic. Look it up.

visarga|7mon ago

OCR system use vision models and as such they can make mistakes. They don't sample but they produce a distribution of probability over words like LLMs.

alex_suzuki|7mon ago

One of my worries for the coming years is that people will forget what deterministic actually means. It terrifies me!

noduerme|6mon ago

Not to get real dark and philosophical (but here goes) it took somewhere around 150,000 years for humans to go from spoken language to writing. And almost all of those words were irrational. From there to understanding and encoding what is or isn't provable, or is or isn't logically deterministic, took the last few hundred years. And people who have been steeped in looking at the world through that lens (whether you deal with pure math or need to understand, e.g. by running a casino, what is not deterministic, so as to add it to your understanding of volatility and risk) are able to identify which factors in any scenario are deterministic or not very quickly. One could almost say that this ability to discern logic from fuzz is the crowning achievement of science and civilization, and the main adaptation conferred upon some humans since speech. Unfortunately, it is very recent, and it's still an open question as to whether it's an evolutionary advantage to be able to tell the difference between magic and process. And yeah, it's scary to imagine a world where people can't; but that was practically the whole world a few centuries ago, and it wouldn't be terribly surprising if humanity regressed to that as they stopped understanding how to make tools and most people began treating tools like magic again. Sad time to be alive.

llm_trw|7mon ago

It really isn't since those systems are character based.

8n4vidtmkvmk|7mon ago

OCR tools sometimes make errors, but they don't make things up. There's a difference.