AI Dev World 2025: A Wake-Up Call for Data from Documents importance for AI

Today, I had the opportunity to attend AI Dev World at the Santa Clara Convention Center. I’ll admit, I walked in with low expectations—after all, we’ve been inundated with hype about generative AI over the past few years. But within just two visits to expo booths and conversations with vendors, my skepticism transformed into sheer excitement. The relevancy of ‘data from documents’ to AI and Large Language Models (LLMs) is not just theoretical—it’s happening, and it’s significant!

A Vision for TWAIN Direct and PDF/R in AI

Over the past few weeks and especially in preparation for the upcoming AI+IM Global Summit in Atlanta, I’ve been socializing a concept that integrates TWAIN Direct and PDF/R technologies into an AI/LLM solution. To date, the response has been lukewarm at best—many see it as too ambitious, outside our scope, or better left to systems integrators. But after what I saw at AI Dev World, I believe that anyone who isn’t seriously considering participating in some form or fashion in this reference solution is making a huge mistake.

Real-World Validation: Edge AI Innovations and Moorche

Let me share one real example that underscores the importance of document data in AI: a company called Edge AI Innovations. They’ve developed a Semantic Search engine called Moorche, designed specifically to train AI using your own documents as the primary data source. Right on their sandbox UI, there’s a button that says, “Upload Document.” That’s how integral document data is to their solution!

Article content
Edge AI Innovations

Here’s how they describe their technology:

“Introducing Moorche Serverless RAG—the simplest way to build secure, high-performance AI chatbots and assistants. Designed by Edge AI Innovations, our platform removes the usual hurdles of setting up and scaling AI systems. With just a few steps—log in, upload your documents, pick a model, and start chatting—Moorche makes it effortless to bring your own data to life.”

After a great discussion at their booth, I couldn’t wait to test it myself. So, I uploaded a white paper on proposed SAML 2.0 support for TWAIN Direct that the TWAIN Working Group will soon publish. The result? I asked a natural language question, and Moorche not only understood but generated a well-reasoned answer: “Yes, TWAIN Direct should support SAML 2.0,” providing a rationale based on the single document I uploaded!

Article content
Moorche answers, “Yes, TWAIN Direct should support SAML 2.0,” with rationale

The Takeaway: Document Data is Critical to AI’s Future

This simple yet powerful example reinforces key points I’ve been advocating for months:

  1. Private Small Language Models (SLMs) built from your own data are critical—relying solely on public LLMs is not the best path forward.
  2. Edge-based AI models for IoT devices are not just possible; they’re highly desirable.
  3. Optimizing networks and reducing energy consumption is a priority for vendors looking to gain a foothold in the AI market.
  4. Data from Documents, including the volume, variety, and real-time distributed capture capability of scanned images from document scanners and copy machines, is a crucial onboarding solution for AI systems.

Call to Action: Our Industry Must Act Now

To my colleagues in the document scanning, capture, and IDP industry—our expertise is more valuable than ever. AI developers are hungry for structured, high-quality data, and we are the ones who can provide it.

If this resonates with you and you don’t want to be left behind, let’s talk! I am actively seeking collaborators for our proposed AI/LLM reference solution project. Contact me, and let’s explore how you can play a role in shaping the future of AI through document data.

Leave a Reply

Your email address will not be published. Required fields are marked *

11 + fourteen =

This site uses Akismet to reduce spam. Learn how your comment data is processed.