Python OCR with OVHcloud AI Endpoints: Extracting Text from Images Using Vision LLMs (and Why This Is Bigger Than “Just OCR”)

AI generated image for Python OCR with OVHcloud AI Endpoints: Extracting Text from Images Using Vision LLMs (and Why This Is Bigger Than “Just OCR”)

OCR is one of those “solved problems” that keeps being… not solved. Yes, we’ve had optical character recognition for decades. Yes, you can absolutely pipe a PDF through an engine and get text out the other side. And yes, it will still confidently turn “Invoice” into “lnv0ice” the moment you show it a rotated scan, a photo taken under bad lighting, or a document layout that dares to include a table.

That’s why a recent OVHcloud tutorial caught my eye: Extract Text from Images with OCR using Python and OVHcloud AI Endpoints, written by Stéphane Philippart and published on April 1, 2026. citeturn2view0 It shows how to perform OCR by sending an image to a vision-capable large language model hosted on OVHcloud AI Endpoints, using the OpenAI Python library against an OpenAI-compatible API. citeturn2view0turn5view1

On paper, that sounds almost too simple: “OCR, but with a VLM.” In practice, it’s part of a larger shift in how developers are approaching document understanding and data extraction: less hand-tuned OCR pipelines, more multimodal models that can interpret messy real-world visuals and return structured text (or at least text with a fighting chance).

This article expands on the OVHcloud post with background, practical architecture advice, security and cost considerations, and honest guidance on when a VLM-based OCR approach is a great idea—and when you should stick to the classic tools that don’t require a GPU farm somewhere in Europe.

What OVHcloud’s tutorial actually demonstrates (and why it matters)

Philippart’s post is straightforward: build a tiny Python script that reads a local image, base64-encodes it, and sends it to a vision model via OVHcloud AI Endpoints using the OpenAI Python SDK. citeturn2view0 The key idea is that OVHcloud AI Endpoints exposes an OpenAI-compatible API, so instead of learning a new client library and request format, you point the existing OpenAI client at a different base_url and use your OVH token as the API key. citeturn2view0turn5view1

The post’s sample run explicitly references Qwen2.5-VL-72B-Instruct as the vision model used via OVHcloud AI Endpoints. citeturn2view0 That’s significant for two reasons:

  • It’s a large, modern vision-language model (VLM) aimed at multimodal understanding—meaning it’s built to reason over images and text together, not just run character recognition on a clean scan.
  • It’s available as a managed endpoint, so you can call it like an API rather than deploying the model yourself.

OVHcloud positions AI Endpoints as a serverless inference API for a catalog of models (LLMs, visual LLMs, speech, etc.) with emphasis on privacy, European hosting, and integration via standard APIs. citeturn5view1turn5view0

VLM-based OCR vs traditional OCR: what changes?

Traditional OCR engines (Tesseract, EasyOCR, PaddleOCR, commercial OCR SDKs, etc.) are generally pipelines: detect text regions, straighten and normalize, recognize characters/words, then optionally do layout reconstruction. If you’ve ever tried to preserve a table layout from a scan, you already know the pain: you don’t just want text, you want text with structure.

VLM-based OCR flips the approach. Instead of building a pipeline that tries to model the problem explicitly, you send the image to a multimodal model and instruct it:

  • extract all text
  • preserve line breaks/columns/tables
  • avoid summarizing or translating

The OVHcloud tutorial even uses a system prompt that tells the model to behave like an “expert OCR engine,” preserve layout, and use Markdown for structure such as tables. citeturn2view0 This is the part a lot of developers miss: with VLMs, prompting is part of the OCR pipeline. In other words, you are now “tuning OCR” by writing instructions, not by tweaking thresholds in image preprocessing.

Where VLM OCR tends to shine

In real applications, vision models can be surprisingly good at things classic OCR struggles with:

  • Complex layouts (mixed columns, callouts, weird spacing)
  • Noisy photos (receipts in bad lighting, skewed angles)
  • Document understanding (recognizing that something is a table and outputting a table-like representation)

And for OCR tasks where you want the model to also do something after extraction (e.g., identify fields on an invoice), a VLM is already “in the right neighborhood” cognitively—though you should be careful not to accidentally turn OCR into free-form interpretation.

Where VLM OCR can be risky

However, VLMs bring different failure modes:

  • Hallucinated characters: the model may “guess” at unreadable parts, especially if your prompt implies it should be helpful.
  • Over-normalization: it may silently fix typos, add punctuation, or “clean up” spacing.
  • Non-determinism: unless you force deterministic settings (like temperature 0), results can vary run-to-run. The tutorial explicitly sets temperature to 0.0 for faithful extraction. citeturn2view0

Bottom line: VLM OCR is fantastic when you need flexibility and layout fidelity across messy inputs, but it’s not automatically better for high-volume, clean-print, “just give me the characters exactly” workloads. For those, classic OCR engines remain hard to beat on cost and predictability.

OVHcloud AI Endpoints in context: “OpenAI-compatible,” but European-hosted

OVHcloud launched AI Endpoints as a way to provide managed access to a catalog of open and open-weight models for use cases like chat, speech, coding assistance, and text extraction. citeturn5view0 The company describes it as “serverless,” pay-as-you-go, and designed to remove the operational burden of running inference infrastructure. citeturn5view0turn5view1

There are two angles here that matter to developers building OCR and document workflows:

  • Integration friction drops dramatically when the API is compatible with tooling you already use. OVHcloud highlights “standard APIs (like OpenAI) for easy integration.” citeturn5view1
  • Data governance becomes a first-class buying criterion in many organizations. OVHcloud positions AI Endpoints around data privacy and sovereignty, stating that user data is not used to train or improve models, and describing “zero data retention” beyond billing needs. citeturn5view1

Whether you’re in Europe or the US, you may still prefer European-hosted inference for certain workloads (especially if customers, regulators, or your legal team have opinions). But be precise: “European hosting” doesn’t automatically solve compliance. It just moves you into a different set of legal, contractual, and technical constraints. Still, for many teams, it’s a meaningful option.

The model used in the example: Qwen2.5-VL-72B-Instruct

The tutorial’s demo output references Qwen2.5-VL-72B-Instruct. citeturn2view0 Qwen2.5-VL is a vision-language model family, and the Qwen Team has published a technical report describing improvements and strong performance, particularly in document and diagram understanding. citeturn3academia15 OVHcloud’s AI Endpoints catalog lists Qwen2.5-VL-72B-Instruct under “Visual LLM,” including its context size and token pricing. citeturn4view0

There’s also an important practical detail: when you move OCR into a VLM, the model size matters. A 72B-parameter model has the capacity to do complicated reasoning over messy visuals, but it also tends to cost more per request and can introduce latency. OVHcloud’s catalog page shows Qwen2.5-VL-72B-Instruct priced at 0.91€ per million input tokens and 0.91€ per million output tokens (at least at the time of crawling). citeturn4view0

Is that expensive? It depends on how you measure. If you’re extracting text from a handful of user-uploaded documents a day, it’s likely trivial. If you’re processing millions of pages, token-based pricing can become the core economics of your product. This is why it’s useful that the AI Endpoints catalog exposes per-model pricing and parameters up front. citeturn4view0

Recreating the approach: a robust Python pattern for OCR via OVHcloud AI Endpoints

Let’s translate the tutorial’s spirit into a production-friendly pattern. The OVHcloud post’s code is intentionally minimal (one file, one dependency, and environment variables for token, base URL, and model name). citeturn2view0 That’s great for getting started. For real systems, you typically want a few more pieces:

  • validation and logging
  • image preprocessing (optional, but often helpful)
  • timeouts and retries
  • output post-processing (e.g., normalizing whitespace, capturing confidence signals where possible)
  • guardrails to prevent the model from “helpfully” interpreting content

1) Configuration via environment variables (the “12-factor” part)

The tutorial uses three environment variables:

  • OVH_AI_ENDPOINTS_ACCESS_TOKEN
  • OVH_AI_ENDPOINTS_MODEL_URL
  • OVH_AI_ENDPOINTS_VLLM_MODEL

That’s a clean interface for CI/CD and containerized deployments. The OVHcloud blog also points readers to the AI Endpoints catalog to find a vision-capable model and retrieve URL/name details. citeturn2view0turn4view0

2) Prompting: make it boring on purpose

When using a VLM as OCR, you want the model to be extremely boring. The system prompt in the OVHcloud tutorial does exactly that: “Extract every piece of text,” preserve layout, do not interpret or translate, and return “No text found” if appropriate. citeturn2view0

In production, I’d add two small but powerful constraints:

  • Ask for a strict output format (e.g., JSON with fields like text, layout_markdown, warnings) if your endpoint/model supports structured outputs. This makes downstream parsing safer. OVHcloud has also published guidance on structured output in AI Endpoints more broadly. citeturn0search6
  • Explicitly forbid guessing: “If a word is unclear, output [illegible].” This reduces “helpful” hallucinations in compliance workflows.

3) Determinism settings

The OVHcloud demo sets temperature to 0.0 to keep extraction deterministic. citeturn2view0 That’s non-negotiable if you’re doing OCR for audit trails, back-office processing, or anything that will be compared later.

4) Data URL image embedding

The tutorial base64-encodes the image and sends it as an OpenAI-style image input inside the chat payload (a data:image/png;base64,... URL). citeturn2view0 This is convenient for local scripts and small files. For large images or multi-page PDFs rendered as images, you’ll want to watch payload limits and consider storing the image temporarily in object storage and sending a signed URL—if supported by your target API.

Cost and performance: token economics for OCR

The interesting thing about VLM OCR is that you’re paying for tokens, not pixels. An image still needs to be processed by the model, but many APIs abstract that into “token” accounting or image-token equivalents. The OVHcloud catalog exposes per-model token pricing, for example listing Qwen2.5-VL-72B-Instruct at 0.91€ per million input tokens and 0.91€ per million output tokens. citeturn4view0

So how do you think about cost?

  • Input tokens: your prompt plus whatever tokenization the API applies to the image (model dependent).
  • Output tokens: the extracted text and layout markers.

If you’re extracting dense documents (think: multi-column legal text with tables), output can be substantial. Ironically, “preserve layout” tends to generate more tokens because of newlines, list markers, and Markdown tables. That’s usually worth it, but it’s a knob you should know exists.

Practical tip: for “OCR only,” cap output length. Your OCR doesn’t need to generate a novel; it needs to return what it sees. If your API supports it, set max tokens. If not, consider chunking by page or region.

Security and privacy: OCR is often the most sensitive workload you run

Many OCR projects begin innocently (“we just need to parse receipts”) and quickly end up handling:

  • names, addresses, phone numbers
  • healthcare documents
  • bank details
  • IDs and passports
  • internal corporate documents

That means your OCR architecture is now a security architecture.

What OVHcloud claims about privacy

OVHcloud’s AI Endpoints page highlights data confidentiality and states that customer/user data is not used to train or improve models. citeturn5view1 It also describes “zero data retention,” keeping only what is required for billing. citeturn5view1

Those are helpful assurances, but treat them as starting points. You still need to evaluate:

  • where the endpoint runs (region/data center)
  • how long logs and metadata are retained
  • how you manage API tokens
  • what encryption is used in transit and at rest (if you store any artifacts)

Threat model checklist (practical and slightly paranoid)

  • Token leakage: rotate keys; never hardcode; scope permissions where possible.
  • PII exfiltration: consider redacting images before sending (e.g., blur faces, mask account numbers) if your use case allows it.
  • Prompt injection via images: yes, images can contain “instructions.” If your system uses the OCR output downstream (e.g., feeding into an agent), treat extracted text as untrusted input.
  • Output validation: if you’re extracting structured fields, validate formats (IBAN checksum, invoice total numeric parse, date formats) before writing to databases.

That last point matters because a VLM is not a deterministic parser. It’s a probabilistic model that can be remarkably accurate and occasionally weird. Your job is to build systems that handle “occasionally weird” without accidentally mailing someone a refund of $9,999 instead of $99.99.

Comparisons: when you should still use Tesseract/EasyOCR/PaddleOCR

There’s a reason traditional OCR isn’t dead: it’s cheap, fast, and predictable. Open-source OCR engines can run on your own CPU boxes with no API calls, which is hard to beat for throughput and data control. Academic and practitioner comparisons show meaningful differences between OCR tools depending on the dataset and preprocessing; for example, some studies report stronger results for certain engines in specific contexts, and emphasize that preprocessing can significantly affect accuracy. citeturn0search12turn0search3

Here’s a pragmatic decision guide:

Use classic OCR when…

  • your documents are clean, printed, and standardized
  • you process very high volume and cost per page matters most
  • you need strict reproducibility across versions
  • you can’t send data outside your environment

Use VLM OCR when…

  • images are messy (photos, skew, noise)
  • layout matters (tables, columns, forms)
  • you want a simpler integration path than building a full OCR+layout pipeline
  • you can tolerate some probabilistic behavior (with guardrails)

The hybrid approach (often the best approach)

Many production systems end up hybrid:

  • Try classic OCR first (cheap/fast).
  • If confidence is low or layout reconstruction fails, fall back to VLM OCR.
  • Optionally, ask the VLM to reconcile or “repair” OCR output rather than reading the image from scratch (less image-token overhead, depending on your setup).

This is also a nice way to control costs while still having a “get out of jail” card for difficult documents.

Case study patterns: where OCR via AI Endpoints fits nicely

Let’s talk about real workflows where this approach makes sense.

1) Support ticket attachments (screenshots everywhere)

Customers love screenshots. Support teams love searchable text. VLM OCR can pull text from UI screenshots, including error messages, button labels, and dialog boxes—things that may confuse classic OCR due to anti-aliased fonts and busy backgrounds.

Pair it with a log enrichment pipeline: extract the error code from the screenshot, then auto-route to the right team. Just remember to sanitize output before feeding it into any automated agentic system.

2) Invoice and receipt processing (semi-structured chaos)

Receipts are the natural habitat of skewed photos, crumples, and “creative” typography. A VLM can often produce a more readable text block than a traditional engine, and it can preserve table-like structures (items, prices) in Markdown tables—exactly what Philippart’s prompt encourages. citeturn2view0

Even if you still use a classic invoice extraction model later, getting a clean first-pass text representation can simplify downstream parsing and human review.

3) Knowledge base digitization (layout preservation matters)

If you’re digitizing internal SOPs, old PDFs, or documentation scans, classic OCR may produce text but lose structure. VLM OCR’s ability to output lists and tables in a readable way can make the difference between “searchable” and “actually usable.”

Operational notes: reliability, latency, and “serverless inference” reality

OVHcloud markets AI Endpoints as a managed service with a stated 99.5% SLA on the AI Endpoints product page. citeturn5view1 In practice, your user experience depends on:

  • model load and capacity
  • request concurrency
  • image sizes
  • token output limits

For a user-facing OCR feature, you’ll likely want asynchronous processing:

  • upload image
  • enqueue job
  • call AI Endpoints
  • store extracted text
  • notify user / update UI

This design turns variable inference latency into a background task, which is both more reliable and kinder to your frontend timeout settings (and your blood pressure).

Developer experience: why the OpenAI Python library matters here

The sneaky brilliance in the OVHcloud tutorial is not that it does OCR—it’s that it does OCR with almost no new mental overhead. The OpenAI Python library is widely used, and the post leans into the idea that “the OpenAI library is compatible with any OpenAI compatible API.” citeturn2view0

This is part of a broader industry pattern: OpenAI-style request/response shapes have become a de facto lingua franca for LLM inference APIs, even when the underlying models are not OpenAI models. OVHcloud explicitly lists “Standard APIs (like OpenAI) for easy integration” as a key feature. citeturn5view1

From a product perspective, that’s huge:

  • it reduces switching costs
  • it expands the ecosystem of compatible tools (gateways, tracing, eval frameworks)
  • it makes multi-provider strategies more realistic

And yes, it also means you can accidentally point your OpenAI client at three different providers across dev/staging/prod and spend a weekend wondering why latency “changed.” Welcome to modern cloud.

Practical improvements you can add immediately

If you’re inspired by the OVHcloud demo and want to build something sturdier, here are upgrades that pay off quickly:

1) Add a “no guessing” policy

Update your prompt: tell the model to output [illegible] for unclear text, and never invent characters. This is essential for legal/financial documents.

2) Request bounding boxes (if supported)

Some VLMs can return coordinates or refer to regions. If your model supports it, you can ask for JSON output with text segments and approximate bounding boxes. That enables highlighting in a UI and human verification workflows.

3) Build evaluation sets

Don’t ship OCR without a test suite. Collect a small representative set of images (with permissions) and maintain expected outputs. Re-run whenever you change models, prompts, or preprocessing. AI Endpoints also highlights “lifecycle management” and model version transparency as part of reproducibility. citeturn5view1

4) Introduce a fallback strategy

Use classic OCR first. Fall back to VLM OCR when:

  • classic OCR returns too little text
  • too many non-alphanumeric artifacts appear
  • layout parsing fails (e.g., table not reconstructed)

This hybrid approach often yields better cost-performance than going all-in on VLMs.

What this means for the future of document AI

The deeper story here isn’t “OCR via an API.” It’s that we’re watching the boundary blur between OCR, layout analysis, and document understanding.

Historically, these were separate components:

  • OCR engine extracts text
  • layout engine reconstructs structure
  • IE/NLP models extract fields

VLMs can do all three in one pass (with varying reliability), which changes how teams design products. It also changes vendor landscapes: you no longer need a specialized OCR SDK if a general multimodal model can do “good enough” OCR plus structure. But you also inherit new responsibilities: cost controls, prompt design, and guardrails against hallucination.

OVHcloud’s AI Endpoints approach—catalog of models, OpenAI-compatible APIs, and developer-friendly integration—fits neatly into that shift. citeturn5view1turn4view0

Getting started: the simplest path (and where to look next)

If you want the original walk-through, start with the OVHcloud post by Stéphane Philippart: Extract Text from Images with OCR using Python and OVHcloud AI Endpoints. citeturn2view0

Then explore:

  • OVHcloud’s AI Endpoints overview for features like standard APIs, privacy positioning, and integrations. citeturn5view1
  • The model catalog to compare vision models and pricing. citeturn4view0

If your team is serious about OCR in production, treat VLM-based OCR as a tool in a toolbox, not a religion. It’s incredibly useful, occasionally magical, and still very capable of making up an extra digit in an invoice total if you don’t keep it on a leash.

Sources

Bas Dorland, Technology Journalist & Founder of dorland.org