Mistral OCR

littlemerman | 1756 points | 3mon ago | mistral.ai

vikp|3mon ago

I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .

Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.

You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .

The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.

Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

lolinder|3mon ago

> with LLM as a judge

For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.

Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.

I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?

[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...

themanmaran|3mon ago

We also ran an OCR benchmark with LLM as judge using structured outputs. You can check out the full methodology on the repo [1]. But the general idea is:

- Every document has ground truth text, a JSON schema, and the ground truth JSON.

- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema

- Compare the predicted JSON against the ground truth JSON for accuracy.

In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.

https://github.com/getomni-ai/benchmark

cdolan|3mon ago

were you guys able to finish running the benchmark with mistral and got a 70% score? Missed that

Edit - I see it on the Benchmark page now. Woof, low 70% scores in some areas!

https://getomni.ai/ocr-benchmark

themanmaran|3mon ago

Yup, surprising results! We were able to dig in a bit more. Main culprit is the overzealous "image extraction". Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002).

And it happened with a lot of full documents as well. Ex: most receipts got classified as images, and so it didn't extract any text.

cdolan|3mon ago

This sounds like a real problem and hurdle for North American (US/CAN in particular) invoice and receipt processing?

lingjiekong|3mon ago

where do you find this regarding "Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002)."?

culi|3mon ago

themanmaran works at Omni so presumably they have access to the actual resulting data from this study

someothherguyy|3mon ago

Wouldn't that just bias itself to the shape of the text extracted from the OCR against the shape of the raw text alone? It doesn't seem like it would be a great benchmark for estimating semantic accuracy?

vikp|3mon ago

Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.

I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.

There are a few different benchmark types in the marker repo:

  - Heuristic (edit distance by block with an ordering score)
  - LLM judging against a rubric
  - LLM win rate (compare two samples from different providers)
None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.

I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.

arthurcolle|3mon ago

You can use structured outputs, or something like my https://arthurcolle--dynamic-schema.modal.run/

to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema

cdolan|3mon ago

What is the project? It just returns a vanilla html page saying:

Dynamic Schema API API is running. See documentation for available endpoints.

arthurcolle|3mon ago

It's just a FastAPI app with endpoints that I developed and deployed before OpenAI released structured outputs that used a custom grammar to enforce a pydantic-like schema for Chain of Thought rollouts / structured data extraction from unstructured text. I also use it for a video transcription knowledge base generation API

https://arthurcolle--dynamic-schema.modal.run/docs

carlgreene|3mon ago

Thank you for your work on Marker. It is the best OCR for PDFs I’ve found. The markdown conversion can get wonky with tables, but it still does better than anything else I’ve tried

vikp|3mon ago

Thanks for sharing! I'm training some models now that will hopefully improve this and more :)

netdevphoenix|3mon ago

LLM as a judge?

Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption

bfors|3mon ago

Perhaps they already evaluated their LLM judge model (with another LLM)

ntkris|3mon ago

This is awesome. Have you seen / heard of any benchmarks where the data is actually a structured JSON vs. markdown?

ChrisRob|3mon ago

Thanks for the tip. Marker solved a table conversion without LLM that docling wasn't able to solve.

codelion|3mon ago

Really interesting benchmark, thanks for sharing! It's good to see some real-world comparisons. The hallucinations issue is definitely a key concern with LLM-based OCR, and it's important to quantify that risk. Looking forward to seeing the full benchmark results.

DeathArrow|3mon ago

>Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?

boxed|3mon ago

Why wouldn't hallucinations be agreed upon if they have roughly the same training data?

TJSomething|3mon ago

A hallucination is often an indication that the model doesn't know something. Then, the internal signal gets dominated by noise from the seeded training weights. Efforts to eliminate hallucinations with a single model have found success by asking the same question in different ways and only taking answers that agree. Logically, you could get more durable results from multiple models on the same prompt.

supriyo-biswas|3mon ago

We had this article the other day[1] about how multiple LLMs can hallucinate about the same thing, so this is not guaranteed to remove hallucinations that are caused by poor or insufficient training data.

[1] https://news.ycombinator.com/item?id=43222027

boxed|3mon ago

I don't see why any of that makes logical sense. These models require such enormous training data that they pretty much MUST use the same training data to a very large degree. The training data itself is what they spit out. So "hallucinations" are just the training data you get out, which is the entire point of the models in the first place. There is no difference between an hallucination and a correct answer from the perspective of the math.

neuronic|3mon ago

Isn' it just statistical word pattern prediction based on training data? These models likely don't "know" something anyway and cannot verify "truth" and facts. Reasoning attempts seem to me basically just like looping until the model finds a self-satisfying equilibrium state with different output.

In that way, LLMs are more human than, say, a database or a book containing agreed-upon factual information which can be directly queried on demand.

Imagine if there was just ONE human with human limitations on the entire planet who was taught everything for a long time - how reliable do you think they are with information retrieval? Even highly trained individuals (e.g. professors) can get stuff wrong on their specific topics at times. But this is not what we expect and demand from computers.

stavros|3mon ago

I like the licensing options! Hopefully they make enough money to fund development.

bambax|3mon ago

It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:

https://i.imgur.com/jcwW5AG.jpeg

For the blocks in the center, it outputs:

> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 août 1607 , 3 mai 1693 ; ép. 1○, le 26 septembre 1644, Diane - Henriette de Budos de Portes, morte le 2 décembre 1670; 2○, le 17 octobre 1672, Charlotte de l'Aubespine, morte le 6 octobre 1725.

This is perfect! But then the next one:

> Louis, commandeur de Malte, Louis de Fay Laurent bre 1644, Diane - Henriette de Budos de Portes, de Cressonsac. du Chastelet, mortilhomme aux gardes, 2 juin 1679.

This is really bad because

1/ a portion of the text of the previous bloc is repeated

2/ a portion of the next bloc is imported here where it shouldn't be ("Cressonsac"), and of the right most bloc ("Chastelet")

3/ but worst of all, a whole word is invented, "mortilhomme" that appears nowhere in the original. (The word doesn't exist in French so in that case it would be easier to spot; but the risk is when words are invented, that do exist and "feel right" in the context.)

(Correct text for the second bloc should be:

> Louis, commandeur de Malte, capitaine aux gardes, 2 juin 1679.)

layer8|3mon ago

> This is perfect!

Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”. These are https://fr.wikipedia.org/wiki/Adverbe_ordinal#Premiers_adver....

There’s also extra spaces after the “1607” and around the hyphen in “Diane-Henriette”.

Lastly, U+2019 instead of U+0027 would be more appropriate for the apostrophe, all the more since in the image it looks like the former and not like the latter.

MatthiasPortzel|3mon ago

Slightly unrelated, but I once used Apple’s built-in OCR feature LiveText to copy a short string out of an image. It appeared to work, but I later realized it had copied “M” as U+041C (Cyrillic Capital Letter Em), causing a regex to fail to match. OCR giving identical characters is only good enough until it’s not.

jorvi|3mon ago

> Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”

Or degree symbol. Although it should be able to figure out which to use according to the context.

TeMPOraL|3mon ago

This is "reasoning model" stuff even for humans :).

layer8|3mon ago

There is OCR software that analyses which language is used, and then applies heuristics for the recognized language to steer the character recognition in terms of character sequence likelihoods and punctuation rules.

I don’t think you need a reasoning model for that, just better training; although conversely a reasoning model should hopefully notice the errors — though LLM tokenization might still throw a wrench into that.

raffraffraff|3mon ago

It feels like, after the OCR step there should be language and subject matter detection, with a final sweep with a spelling / grammar checker that has the right "dictionary" selected. (That, right there, is my naivety on the subject, but I would have thought that the type of problem you're describing isn't OCR but classical spelling and grammar checking?)

layer8|3mon ago

It’s OCR because the wrong characters are being recognized. This is not about fixing spelling or punctuation mistakes present in the source image, it’s that errors are being introduced, due to a lack of accuracy of this OCR with regard to punctuation and typography. The punctuation errors are not different in principle from the case of the OCR producing a misspelled word that wasn’t misspelled in the image being OCRed.

A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text.

And specifically for the “white circle” character, it would be difficult to correctly infer the original ordinal markers after the fact. I myself could only do so by inspecting the original image, i.e. by having my brain redo the OCR.

raffraffraff|3mon ago

> A subsequent cleanup pass that fixes grammar/spelling errors, as you propose, wouldn’t be appropriate when the goal is to faithfully reproduce the original text

I suppose that depends on why it's wrong. Did the model accurately read a real typo in the image or did it incorrectly decipher a character? If a spelling & grammar pass fixes the latter, isn't it valid?

pbhjpbhj|3mon ago

Not unrelated - OneNote 'copy text from image' has started producing lots of incorrect OCR results, but they're all non-words.

For example, from a clear image of a printed page (in a standard font), it will give me 'cornprising' instead of 'comprising'; 'niatter' instead of 'matter'. Excepting the spell-check underline they'd be hard to spot as with relatively tight kerning all the errors look like the originals.

I'm surprised as 1) I've not had these sorts of errors before, 2) they're not words, and words must be heavily weighted for in the OCR engine (I'd have thought).

bambax|3mon ago

Another test with a text in English, which is maybe more fair (although Mistral is a French company ;-). This image is from Parliamentary debates of the parliament of New Zealand in 1854-55:

https://i.imgur.com/1uVAWx9.png

Here's the output of the first paragraph, with mistakes in brackets:

> drafts would be laid on the table, and a long discussion would ensue; whereas a Committee would be able to frame a document which, with perhaps a few verbal emundations [emendations], would be adopted; the time of the House would thus be saved, and its business expected [expedited]. With regard to the question of the comparative advantages of The-day [Tuesday]* and Friday, he should vote for the amendment, on the principle that the wishes of members from a distance should be considered on all sensations [occasions] where a principle would not be compromised or the convenience of the House interfered with. He hoped the honourable member for the Town of Christchurch would adopt the suggestion he (Mr. Forssith [Forsaith]) had thrown out and said [add] to his motion the names of a Committee.*

Some mistakes are minor (emnundations/emendations or Forssith/Forsaith), but others are very bad, because they are unpredictable and don't correspond to any pattern, and therefore can be very hard to spot: sensations instead of occasions, or expected in lieu of expedited... That last one really changes the meaning of the sentence.

spudlyo|3mon ago

I want to rejoice that OCR is now a "solved" problem, but I feel like hallucinations are just as problematic as the kind of stuff I have to put up with tesseract -- both require careful manual proofreading for an acceptable degree of confidence. I guess I'll have to try it and see for myself just how much better these solutions are for my public domain archive.org Latin language reader & textbook projects.

qingcharles|3mon ago

It depends on your use-case. For mine, I'm mining millions of scanned PDF pages to get approximate short summaries of long documents. The occasional hallucination won't damage the project. I realize I'm an outlier, and I would obviously prefer a solution that was as accurate as possible.

eMPee584|3mon ago

possibly doing both & diffing the output to spot contested bits?

spudlyo|3mon ago

that’s my current idea, use two different ocr models and diff the results to spot check for errors. at these prices why not?

thomasfromcdnjs|3mon ago

Does anyone know the correlation between our abilities to parse PDF's and the quality of our LLM's training datasets?

If a lot of scientific papers have been pdf's and hitherto had bad conversions to text/tokens, can we expect to see major gains in our training and therefore better outputs?

rossant|3mon ago

Your example doesn't seem that difficult to me.

samstave|3mon ago

[flagged]

Biganon|3mon ago

...are you okay?

samstave|3mon ago

[flagged]

Kokichi|3mon ago

All it ever does is hallucinate