The logic of document capture

Indexing, Metadata, Keyword, SharePoint, Capture, Scanner, Documents, ECM, Content Management

What is wrong with the collection of words above?  Well, it’s a collection of terms that are closely related but have no logical structure in order to be of value to anyone reading them.  In order for these words to be valuable in terms of readability for context they need to be logically organized into a sentence.  The logic of document capture and Enterprise Content Management is much the same.  In this blog post, instead of going into the nuts and bolts of document capture I thought it is more important to discuss two critical components to your overall success, or failure, of your content management strategy.  These two critical components are taxonomy and metadata.  This is philosophy and not technology.

To break down document capture in its simplest form, just think of this as the process of extracting information from a document and making that information available in the future.  The future could be immediate where a scanned invoice, for example, immediately kicks-off a payment process.  Or it could be two weeks from now where a customer service agent needs to retrieve a signed airbill for a proof of delivery.  The point is that document retrieval is based on some unique keyword or a set of keywords related to a particular document.  In the case of the invoice it could have been the invoice number and in the case of the airbill it could have been the shipping tracking number.

If you do not consider a well thought-out strategy then your organization could have accomplished the task of taking an organized paper mess and simply converted it to an electronic mess.

Establish a well thought-out taxonomy

Taxonomy is defined as classifying organisms into groups based on similarities.  Why is taxonomy relevant for document capture?  For several reasons, including security, quicker access to information and retention policies.  So, if you work backwards in the methodology of how and what, technology to implement for your document capture solution a solid consensus of the end result is of paramount importance.  The end result is typically a high-quality scanned image conducive for data capture (OCR, ICR, OMR, bar code, etc.) and the metadata itself.  So if your taxonomy has organized methodology then it should assist in making your document capture strategy fairly obviously.  Let’s take security as a benefit for a well thought-out taxonomy strategy.  By segregated documents based on a logical taxonomy, organizations are afforded an addition level of comfort knowing that a set of security policies can be applied to, for example, Human Resource, documents allowing access to everyone for a general set of available scanned documents such as the café menu which is clearly not a information sensitive document.  Additionally, another benefit of a well thought-out taxonomy is quicker access to information for users.  Many content management software applications and search engines use a ‘crawl’ method to check newly added content and add them to an index (database) which is then searchable.  As you can imagine, common sense and logic dictates that ‘crawling’ a more narrow scope is much quicker to keep the database up-to-date, but also access times could be considerably less by not having to search the entire database and only the relevant data indexed.  This makes access to data quicker.  Lastly, in regards to retention policies, having your data well organized is a major benefit for this area.  Imagine that an organization has all of their tax documents properly electronic stored via a well thought-out taxonomy in their content management system.  If they did then easily, and within corporate governance standards and policies the organization can removed these images from their repository based on a retention schedule.  So, as illustrated, investing the time to develop a strong taxonomy is important for many reasons including security, searchability and retention.

It is extremely important to not over look this important concept when planning out a document capture strategy.  A simple taxonomy might be organized like below:

  • Accounting
    • Accounts Receivable
      • Check
      • Statement
    • Accounts Payable
      • Invoice
      • Receipt
  • Human Resources
    • Applications
    • Resumes
    • W2 Forms

taxonomy

Considering a well thought-out strategy might seem cumbersome in the initial stages of establishing your document capture strategy, but it can save organizations significant time, money and aggravation in the long-run.  As a best document capture practice it is important to establish a solid taxonomy for scanned documents and also re-evaluate the strategy as it relates to taxonomy as any new documents are introduced within your organization.

 

Consider what information is important, and what is not

Creating Searchable PDF’s is one form on document capture; however, it is not always an ideal document capture strategy.  While sometimes, in certain situations, creating Searchable PDF images of your scanned documents is the right approach for an organization sometimes this technique of document capture often creates inefficiencies.  You might be thinking to yourself how could creating a fully Searchable PDF with all the words of the document indexed be construed as being inefficient?  Let me elaborate.  When creating a Searchable PDF the scanning software does its best job possible to recognize every single character and every single word on a page.  This might sound appealing but let’s consider the possible results in real-world applications.  Imagine that an organization in the insurance business scans as little as 100 single-page documents and creates Searchable PDF documents.  Then they want to retrieve a document based on a keyword so they use the word “claim” in their search criteria to find a document a user is searching for.  As you can imagine the user would most likely be presented with a long set of links to possible documents but only one is the important document they are looking for and the rest is “irrelevant search”.  This is because the entire page was indexed via the Searchable PDF method.  Alternatively, if your data capture strategy had included only extracting “relevant search” terms that apply to a particular document then you make the organization much more efficient by being able to find the data you have requested much quicker with the first search.

One of the other significant benefits with an integrated document capture/content management strategy is that often times any sort of metadata fields created, and rules applied, in the content management system can be brought forward and applied into the document capture system itself.  For example, if an organizations’ policy dictates that on a healthcare insurance form that for a metadata field the social security number is required and can only be nine characters long of numeric characters, then directly in the document capture system these rules can be enforced.  This allows for great business continuity and consistency in your data capture process.

An analogy I like to use is go to your favorite internet search engine and enter in a vague term such as “taxonomy for document capture” then you will get a long list of ‘hits’ that probably are not of interest because you might be looking for a specific piece of information, or a scanned image.  In the contrary, if the user enters-in a more specific term such as “aim document taxonomy” then the focus of the search is narrowed down to a more relevant list of potential information the user is searching for.  This is an example of relevant search versus irrelevant search and it’s all related to applying metadata to web pages, electronic documents and, yes, especially scanned images.

Summary: Organized taxonomy + relevant metadata = Efficient process

In summary, my point is to carefully plan out your document capture process.  Pay close attention to developing an effective taxonomy for your documents.  Determine what information is important on a particular document and what is not.  Document capture technology has evolved to nearly magically proportions but, the truth is that organizations can still greatly help their efficiency and content management effectiveness through careful planning; after all there still is logic to document capture.

Do you have thoughts of the topic of document capture, taxonomy or classification?  Please share your comments.

Building an effective capture solution – Part 3 of 3 (Storage/Business Policy/Workflow)

Building an effective capture solution – Part 3 of 3 (Storage/Business Policy/Workflow)

 

The real value of capture is realized when the information extracted from images is used within a business process whether this information is used, for example, to kick-off an approval process for expense reports, or this information is a Social Security Number used to retrieve your medical records.  The ‘index values’, ‘metadata’, or ‘tags’ (whatever) you would like to call these extracted keywords help create the workflow that helps make processes more efficient.  After all, an image itself without recognized characters, numbers or words is useless to a computer for knowledge of what information is contained on the document.  It’s the information on the document that is of most importance, not just the image.

These days there are many great storage options for images and metadata captured but not all are created equal.  Below are a few considerations for storage as it directly relates to document capture.

Storage considerations for document capture applications:

  • Does your storage, and image viewer, support well known document formats such as TIFF, PDF, PJEG, DOC, XLS and others as well as emerging formats such as PDF/A or XML?  A universal viewer that supports a wide range of formats is preferable because you never know how requirements might change in the future.  Also, you might want to consider a viewer that allows for annotation, or markup, of images with items such as sticky notes, highlighting or shapes if your process requirements dictate these needs.
  • The capture process is all about extracting metadata from images so, therefore, does your storage provide a metadata framework in which you can store this information to enhance search and retrieval?  Basically this means does the storage provider offer a method to map captured index fields to database storage fields.
  • Security.  Of course security should be a major concern if your information is not intended for public consumption.  While it’s an important issue, in general, if you ensure three simple features of your solution then you will address 80% of potential problems:  (1) Secure disk-wiping of temporarily image files, (2) Encrypt data in motion and (3) Encrypt data at rest.  Of course these are not the only three items to consider but start with these and research other security techniques based on the sensitivity of your information.
supporting_file_formats supporting_metadata encryption

Now that we have covered two of three basic components of ‘Building an effective capture solution’ which included User Experience and Processing and having just outlined some Storage considerations, we should focus on the main theme of these posts and this is the point that ‘Capture begins with process‘.  In other words, and as I stated in the prelude to this series of blog posts, before considering all the technology and architectural options you should careful consider the business process or process workflow first.  Capture does not begin with a scan of a paper or picture of an image from a smart phone, it begins with process.

Below are a few considerations of business applications providers as it relates to document capture specifically:

Business rule considerations for capture:

      • Data Type constraints.  If the field is a ‘Date’ field then restrict the data in this field to only date values.  Or if the field is a ‘Social Security Number’ or ‘Phone Number’, then, naturally, allow only number instead of letters.  Conversely, if the field is a ‘Name’ field then the data type should only allow for letters instead of numbers.
      • One of the greatest ways to ensure business continuity, as well as reduce errors in your document capture solution, is to perform database validation.  In other words, when a particular piece of information, such as a Phone Number, is extracted from a document then a database lookup is executed to match that the Address field corresponds with the Phone Number field.  If it doesn’t, or there are multiple matches, then the capture workflow can automatically send the information to a validation station where a human will verify the correct data.  This helps to achieve the highest level of accuracy.
      • Handling exceptions is a critical, yet often overlooked part of the overall capture strategy.  We all hope our system works 100 percent perfect but this is just not reality for many reasons.  After all, there are a lot of moving parts in these types of solutions:  People, process, hardware, software, client, server, etc.  Be prepared, and actually expect the fact that ‘things’ will happen.  Try and define the possibilities.  For example, if you are automatically classifying documents, expect that the system will have unrecognized documents and be prepared to send those to an exception queue for manual classification.  Consequently this is also a great opportunity to ‘tune’ the system by adding a classification technique to recognize this document type in the future.  It’s an opportunity to create a process to improve the system accuracy over time from an activity that might have been perceived as a negative had exceptions not been considered.
data_type_constraints database_validation

Now that we have discussed some of the high-level concepts of building an effective capture solution, I invite you to dig a bit deeper into specifics of each area of interest to you.  We have many educational articles to supplement each of these three components of a solution including some of the following:

Building an effective capture solution:

Part 1 of 3 (User Experience/Device/Interface):  Network scanningmobilemultistream/color dropout
Part 2 of 3 (Capture/Processing/Transformation):  High resolution scanningforms processingAs a Service
Part 3 of 3 (Storage/Business Policy/Workflow):  SharePointcloud computingtaxonomies/metadata

Finally, if I could leave you with one bit of advice, or wisdom, from my industry experience is that in order to build a highly effective capture solution you should reverse-engineer the solution starting from the process and, ultimately, the choice of device and other considerations should be fairly obvious.  Not device to process.  Start by defining the process then build accordingly.  This will ensure the highest level of success, efficiency and high user adoption.

capture begins with process_arrow

capture begins with process_network

capture_processing_transformation_arrow_leftfacing