Extract Text from Images with Python OCR (Without the OCR Pain): A Deep Dive into OVHcloud AI Endpoints

AI generated image for Extract Text from Images with Python OCR (Without the OCR Pain): A Deep Dive into OVHcloud AI Endpoints

OCR is one of those problems that sounds solved until you actually try to solve it.

You grab a “classic” OCR engine, point it at a photo of a receipt, a screenshot of a dashboard, or a scanned PDF with a weird table layout… and suddenly you’re in a world of skew correction, DPI arguments, language packs, bounding boxes, and a mysterious output where “S” becomes “5” and “O” becomes “0” (or vice versa, depending on how much the OCR gods dislike you that day).

So when OVHcloud published a walkthrough on using a vision-capable LLM for OCR with Python and OVHcloud AI Endpoints, it hit a sweet spot: developer-friendly, minimal dependencies, and leveraging models that can “read” images in a way that’s often more resilient than brittle OCR pipelines. The original article is written by Stéphane Philippart and was published on April 1, 2026. This piece builds on that foundation, adds industry context, practical engineering advice, and a few reality checks you’ll want before you OCR your entire company’s document archive with a multimodal model and then wonder why your finance team is suddenly “very engaged.” citeturn1view0

What OVHcloud is actually proposing here (and why it’s interesting)

The OVHcloud approach is straightforward: instead of using a specialized OCR engine (like Tesseract or a commercial document AI product), you send an image to a vision-capable large language model hosted on OVHcloud AI Endpoints, and ask it—via a carefully written system prompt—to return the text, preserving layout as much as possible.

In Stéphane’s example, the entire app is a single Python file, and the only dependency is the official OpenAI Python library installed via pip install openai. OVHcloud can do this because AI Endpoints exposes an OpenAI-compatible API, so the OpenAI client can be pointed at OVHcloud by setting a custom base_url and using an OVH token as the API key. citeturn1view0turn3search4

Why “LLM OCR” can outperform traditional OCR (sometimes)

Traditional OCR pipelines tend to be modular: image preprocessing → text detection → character recognition → layout reconstruction. This is great when you need predictability, speed, and bounding boxes—but it can get fragile on real-world messiness: curved text, stylized fonts, screenshots with charts, documents with mixed languages, or photos taken at an angle.

Vision-language models can sometimes do better because they’re not just decoding characters; they’re using learned representations that mix visual features and language priors. In practice, that often means fewer catastrophic failures on “non-scanner-perfect” inputs (though they can still hallucinate, which we’ll discuss in the “don’t ship this blind” section).

A quick tour of the original OVHcloud workflow

Here are the key mechanics from the OVHcloud blog post, paraphrased and expanded (not copied):

  • Set environment variables for your OVH AI Endpoints access token, the model URL (as the base URL), and the model name you want to call. citeturn1view0
  • Install the OpenAI Python library and instantiate an OpenAI client with api_key and base_url. citeturn1view0turn3search4
  • Load an image and encode it as base64 so it can be sent as a data URL in an image_url payload.
  • Use a system prompt that instructs the model to behave like an OCR engine: extract all text, preserve layout, do not interpret/summarize/translate, and output markdown if helpful (tables, lists). citeturn1view0
  • Call chat completions with temperature=0 to reduce “creativity” and get more deterministic extraction. citeturn1view0

In the sample output, OVHcloud demonstrates running OCR through a vision model (they mention Qwen2.5-VL-72B-Instruct in the example run output) and receiving clean extracted text. citeturn1view0turn2view1

Why OVHcloud AI Endpoints matters in 2026: the “OpenAI-compatible API” era

There’s a bigger trend here: developers increasingly want to write to one client library and swap providers by changing configuration. The OpenAI client’s ability to accept a custom base_url enables this pattern, and OVHcloud is explicitly leaning into it by exposing an OpenAI-compatible interface. citeturn1view0turn3search4

OVHcloud even markets AI Endpoints as a serverless inference API providing access to a catalogue of models and emphasizes privacy and confidentiality, including a stated commitment that user data is not used to train or improve their AI models. citeturn2view2

Model choice isn’t trivia: it defines your OCR quality

OVHcloud’s AI Endpoints catalog includes visual LLMs such as Qwen2.5-VL-72B-Instruct and other multimodal models. The catalog also lists per-model pricing units (often per million tokens) and indicates capabilities like multimodal support. citeturn2view1

For OCR tasks, you should care about:

  • Text fidelity (does it transcribe accurately, including punctuation, IDs, and numbers?)
  • Layout faithfulness (can it preserve tables/columns reasonably?)
  • Language coverage (your receipts and invoices will inevitably show up in French, German, Japanese, or “airport kiosk English”)
  • Latency (72B models can be excellent but not always cheap or fast)
  • Hallucination risk (more on that soon)

Let’s talk engineering: how to make LLM-based OCR less scary in production

If you’re using a vision LLM as an OCR engine, you’re effectively trading one set of problems (preprocessing and brittle OCR) for another set (prompting, variability, cost control, and output validation). Here are practical techniques that make it work reliably.

1) Treat the prompt like an API contract

The original OVHcloud post includes a system prompt that explicitly says: extract every piece of visible text, preserve layout, do not interpret/summarize/translate, and output markdown for structure. That prompt design is not cosmetic; it’s the difference between “transcription” and “creative writing.” citeturn1view0

In production, I’d tighten it further:

  • Tell the model to output only the transcription (no preamble like “Sure!”).
  • Provide explicit rules for uncertain characters (e.g., use [?] markers).
  • For forms/invoices, consider requiring JSON output with strict keys—but be aware that strict structured output may reduce layout fidelity.

2) Use temperature=0—and still validate output

OVHcloud sets temperature to 0.0 to keep output deterministic. That’s correct for OCR-style tasks. citeturn1view0

But even with temperature at zero, you can still get surprising outputs due to:

  • Ambiguous pixels (low-res photos, motion blur)
  • Model limitations on tiny fonts
  • Compression artifacts (especially with screenshots passed around in Slack, then re-saved, then re-uploaded)

So build a validation layer. Examples:

  • Regex checks for invoice numbers, dates, IBAN formats, totals.
  • Checksum validation when applicable (VAT IDs, ISBNs).
  • Confidence heuristics: if the model output contains too many “uncertain markers,” send it to a fallback pipeline.

3) Use a dual-pass strategy for “hard” documents

One reliable pattern is “read twice, compare once”:

  • Pass A: ask for plain text transcription.
  • Pass B: ask for a structured extraction (e.g., invoice fields) based only on what it sees.
  • Compare for consistency: if Pass B invents a total not present in Pass A, you’ve likely hit hallucination or a misread.

This costs more, but it can be cheaper than support tickets and accountants yelling “why is the VAT 0.00 again?”

4) Consider a hybrid pipeline: traditional OCR + LLM “layout repair”

For bulk document ingestion, you can reduce cost by using a conventional OCR engine for the first pass and use the vision LLM only when:

  • OCR confidence is low, or
  • the document is highly structured (tables), or
  • you need semantic understanding (e.g., which number is the total vs. subtotal).

This hybrid approach also reduces hallucination surface area because you’re using the LLM more as a post-processor than as the primary “reader.”

Cost and pricing: what you should understand before scaling

AI Endpoints is marketed as pay-as-you-go, and OVHcloud’s corporate comms describe pricing varying by model and measured via consumption units (e.g., token-based). citeturn2view0turn2view1

The catalog page itself lists prices per million tokens (input/output) for many models, and you can see that visual models can be priced differently than smaller text models. citeturn2view1

In other words: the “hello world OCR demo” is cheap; the “OCR 8 million images from the last decade of procurement emails” can become a line item.

Practical cost levers

  • Resize images sensibly: don’t send 8K images if 1600px width captures the text.
  • Crop aggressively: receipts often have huge blank margins.
  • Batch and queue: avoid spikes, respect rate limits, and improve predictability.
  • Cache results: if the same image is processed multiple times, store the extracted text and hash the file.

Security and privacy: the part everyone claims they care about (and then forgets)

OCR frequently touches sensitive data: addresses, emails, phone numbers, bank details, healthcare identifiers, employee information, and whatever was scribbled on a whiteboard in the background of a “quick photo.”

OVHcloud positions AI Endpoints with an emphasis on data confidentiality and states that user data will not be used to train or improve its AI models. citeturn2view2

That’s valuable, but you still need to design responsibly:

  • Data minimization: crop to only the region with needed text.
  • Retention controls: decide how long you store images and extracted text.
  • Access control: extracted text is often more searchable (and therefore more dangerous) than the original image.
  • Redaction: for some workflows, you can redact PII before sending images—though doing that on images can be nontrivial.

Developer experience: OpenAI Python library + custom base_url

The OVHcloud tutorial’s quiet win is developer ergonomics. Most teams already have glue code and libraries built around OpenAI-style “chat completions,” so being able to keep the same client and swap the endpoint is a huge accelerant. citeturn1view0turn3search4

And this isn’t just theoretical. The OpenAI Python library documents passing a custom base_url when instantiating the client, which is exactly what OVHcloud uses in its example. citeturn3search4

Alternative: OVHcloud’s own Python tooling

OVHcloud also offers an ovhai SDK (Python 3.8+) for interacting with OVH AI APIs, including sync and async helpers. If you want a provider-native experience, that’s another route—though the OpenAI-compatible approach is likely to be the fastest on-ramp for teams already living in that ecosystem. citeturn2view3

Accuracy pitfalls: where vision LLM OCR can still fail (spectacularly)

Traditional OCR fails in boring ways: garbled characters, missing words, broken columns. Vision LLM OCR can fail in interesting ways: it might confidently output text that is plausible but not present.

Here’s where to be cautious:

  • Low-resolution screenshots of dense UIs (think logs, terminals, dashboards).
  • Heavily stylized typography (brand fonts, neon signs, cursive).
  • Tables with faint gridlines or alternating row shading where alignment matters.
  • Handwriting: sometimes surprisingly good, sometimes “modern art.”

Mitigation: ask for “verbatim or mark unknown”

One of the most effective prompt tweaks is requiring explicit uncertainty markers. Humans reviewing OCR output don’t mind seeing [illegible]; they mind seeing wrong values that look right.

Use cases that actually benefit from OVHcloud’s approach

Not everything should be OCR’d with an LLM. But several use cases are a strong match:

1) Internal knowledge capture from screenshots

Teams screenshot everything: error messages, dashboards, code snippets (yes, even though you shouldn’t), and incident timelines. OCR-ing these screenshots makes them searchable and indexable in internal tools, which can improve incident response and reduce “we saw this before, where?” moments.

2) Document intake for SMB workflows

Invoices, purchase orders, delivery notes, and receipts are where layout complexity lives. A vision model can handle messy scans better than many classic engines, especially when the goal is human-readable transcription first, and structured extraction second.

3) Compliance archiving (with caution)

When you need searchable archives, OCR is essential. But compliance also requires auditability. If you use a probabilistic model, keep the original images, store the prompt version used, and track model versions so results can be reproduced (or at least explained).

Competitive context: where AI Endpoints sits versus the usual suspects

In 2026, OCR and document AI sits on a spectrum:

  • Classic OCR engines (open source and commercial) optimized for speed, local execution, and predictable outputs.
  • Document AI suites that combine OCR with form understanding, field extraction, and training pipelines.
  • Vision LLM APIs that can do OCR-like transcription and also interpret what’s in an image.

OVHcloud AI Endpoints is positioned in that third category, with the additional “drop-in client compatibility” angle and a model catalog approach. citeturn2view2turn2view1

What’s genuinely different: sovereignty narratives meet developer pragmatism

“Sovereign cloud” and “data residency” are often used as marketing confetti, but there is a real demand—especially in Europe—for AI services that align with local regulatory expectations and procurement rules. OVHcloud’s messaging around confidentiality and not using customer data for training is part of that positioning. citeturn2view2

At the same time, OVHcloud didn’t build a bespoke developer experience that requires learning a whole new paradigm. Instead, they meet developers where they are: OpenAI-style APIs, familiar client libraries, and environment-variable-based configuration. citeturn1view0turn3search4

A more production-ready Python pattern (conceptual example)

The OVHcloud post demonstrates the essentials. If you’re going one step further, here’s what I’d add in a production script (described conceptually rather than pasting a full codebase):

  • Input normalization: convert images to a consistent format (PNG), resize, and optionally deskew.
  • Timeouts and retries: wrap the request with exponential backoff on transient failures.
  • Observability: log request IDs, latency, model name, image hash, and token counts where available.
  • Output post-processing: strip leading/trailing chatter; enforce “only text” outputs; validate key patterns.
  • Human review loop: route low-confidence outputs to review, especially for financial documents.

What to watch next

Three trends will shape how useful this approach becomes over the next year:

  • Better multimodal OCR specialization: more models are tuned for reading order, tables, and multilingual text.
  • Structured extraction layers: vendors will increasingly bundle “OCR + schema extraction + validation” as an end-to-end workflow.
  • Standardization around OpenAI-compatible APIs: the ecosystem is converging on a small set of familiar endpoints and client patterns, which makes provider switching easier—but also makes it easy to forget that models behave differently even behind identical APIs.

Bottom line

The OVHcloud tutorial by Stéphane Philippart is a clean demonstration of a modern pattern: use a vision-capable model as an OCR engine, call it with the OpenAI Python SDK, and keep the code footprint small. citeturn1view0

Where it gets compelling is when you combine that simplicity with real-world safeguards: strict prompts, deterministic settings, validation, and a hybrid fallback strategy. If you do that, “OCR with a vision LLM” stops being a demo trick and becomes a pragmatic option for teams that want better resilience on messy documents without building (and maintaining) an entire document AI pipeline from scratch.

Just remember: the model can read, but it can also guess. Your job is to make guessing obvious, measurable, and recoverable.

Sources

Bas Dorland, Technology Journalist & Founder of dorland.org