Why finance teams keep losing this game
Accounts payable still runs on a copy-paste loop. Vendor sends a PDF. AP analyst opens it. Types vendor, invoice number, date, amount, tax, currency, PO, line items into NetSuite or QuickBooks. Saves it. Files it. Repeats. Five hundred times a month. Mid-market companies pay full-time salaries to people doing this. Enterprise companies build entire AP automation programs around it.
Generic OCR tools (Adobe, Tesseract, ABBYY) get you the text but not the structure. Template-based extractors work until vendor 47 sends an invoice in a new layout. Modern invoice OCR + LLM extraction is template-free: you describe what you want, the system reads each invoice page-by-page and returns structured JSON.
How Aarkiv's invoice extractor works
- Drop in your files. One PDF, fifty PDFs, or a ZIP with all of them. We accept PDF, PNG, JPG, WEBP, and zipped batches.
- Define the fields. Vendor, invoice_number, invoice_date, total, currency by default. Add tax, PO number, billing_address, line_items, anything else. Up to twelve fields per job.
- Hit Extract. Aarkiv validates every file, safely unpacks any ZIP, sniffs magic bytes (no executables get in), reads each page with a vision pipeline, and pulls the fields you asked for.
- Download Excel. One row per invoice, one column per field. Plus the original filename so the audit trail is clean.
Formats and edge cases we handle
- Born-digital PDFs generated by QuickBooks, Stripe, Razorpay, Zoho, NetSuite, etc.
- Scanned PDFs from copiers and mobile scans.
- Image invoices (PNG, JPG, WEBP) from phone cameras and email attachments.
- ZIPs of bills from monthly vendor dumps or Outlook exports.
- Multi-currency, INR, USD, EUR, GBP, AED, the model returns the symbol or ISO code as printed.
- Hotel folios, ride invoices, equipment bills, telco statements, anything with a vendor + amount + date.
Built secure by default
Invoice extraction touches every vendor relationship and a lot of regulated data. Aarkiv treats uploads as hostile until proven otherwise: ZIPs are walked entry-by-entry with symlink, path-traversal, encrypted-entry, and zip-bomb defenses; every saved file is magic-byte sniffed (no .pdf containing executables); files land outside the web root with 0600 permissions owned by a non-privileged user. Hard cap of ten pages per turn keeps abuse small and predictable.
Aarkiv vs. classic invoice OCR tools
No per-template setup
Aarkiv reads layout from the page itself. New vendors work on day one, no template authoring, no field mapping rules.
Structured output, not text dumps
Adobe gives you a searchable PDF. Aarkiv gives you the fields, in Excel, ready for your ERP.
Vision + reasoning, not just OCR
Reading the characters is the easy half. Knowing what is the invoice number vs. the PO number vs. the customer ref is the hard half.
Try it on your own invoices
Ten free pages on sign-up. Drop in your messiest scan. If it does not return what you need, talk to sales, AP volume above a few hundred invoices a month usually wants a private deployment.
FAQ
How do I convert a PDF invoice to Excel?
Sign in, open Invoices, define the fields you want, drop in your PDFs, hit Extract, download the Excel sheet.
Can I process hundreds of invoices at once?
Yes. Upload multiple PDFs, or a ZIP of bills. Aarkiv parallelizes them.
Does it work on scanned PDFs?
Yes, the vision pipeline reads scans and image-only PDFs.
What fields can I extract?
Anything you can name. Up to twelve fields per job.
Is there a free tier?
Ten free pages on every new account, no credit card.