Through-The-Web Document Viewing

How CastleCMS handles viewing PDFs and other document formats in the browser without plugins

The core of CastleCMS is an integration with DocumentCloud's DocumentViewer. The viewer presents a clean, simple, and accessible interface to navigate through uploaded documents. The document can be viewed as images of pages or as pages of text (automatically generated from an optical character recognition [OCR] process) and can be searched with keywords.

Beyond PDF documents, the DocumentViewer can also display Word, Excel, Powerpoint, HTML, RTF, and more document formats.

To perform OCR automatically on documents presented with the DocumentViewer, the powerful Tesseract Open Source OCR Engine is used to its fullest effect.

How does it work in CastleCMS?

First, an administrator or content editor of a CastleCMS website uploads a document. If so desired, the site will then automatically attempt to strip metadata from File and Image objects (as best as can be determined). Then, the document is converted to a series of images, and text is extracted, if possible. Text extraction involves first checking for usable text, and then attempting to perform OCR to generate text if none is found. Once text is extracted, it is indexed and searchable.

Once the document has been processed, the document can be viewed directly or a page can be created with a DocumentViewer tile that links the document.

Installation and Configuration Tips

Docsplit

After a fresh install of CastleCMS, make sure to install the powerful Docsplit utility, which will allow the Document Viewer product to generate searchable plain text, images, and thumbnails of uploaded documents.

Docsplit has many optional dependencies. For the most broad use, it’s recommended to install all of them: GraphicsMagick, Poppler, Ghostscript, Tesseract, pdftk, and LibreOffice.

On Ubuntu, all of the above would be installed with a few simple commands:


$ sudo apt update

$ sudo apt install graphicsmagick poppler-utils poppler-data ghostscript tesseract-ocr pdftk libreoffice

Optical Character Recognition

OCR support is disabled by default (even if all the dependencies for Docsplit are installed). It requires the Tesseract Open Source OCR Engine to be installed (the tesseract-ocr package in Ubuntu) and the option for attempting OCR enabled in the Document Viewer settings within CastleCMS.

To enable OCR support:

Navigate to Site Setup
Navigate to Document Viewer Settings
Find the “OCR” option, and make sure the checkbox is checked
Save the settings

Serving Files with a Static File Server

In many circumstances, serving documents with a dedicated static file server might be much more desirable due to resource limitations and traffic patterns. With CastleCMS and Document Viewer, this is easy to accommodate.

First, make sure documents are stored with the ‘File’ type instead of the ‘Blob’ type (which is default) and a valid path is specified for where to store the files:

Navigate to Site Setup
Navigate to Document Viewer Settings
Select the ‘File’ option in the ‘Storage Type’ dropdown
Enter a valid file system path to where files should be saved in the ‘Storage Location’ text box
Save

Then you could install a static file server, such as ‘nginx’, like so (on Ubuntu):

$ sudo apt install nginx

With a site configuration that could look something vaguely like:

server {

    listen   80 default;

    server_name _; 


     location /@@dvpdffiles/  {

         alias  /opt/dvpdffiles/;

         index  index.htm;

      }

}

And now, URLs pointing at files generated by Document Viewer will be served by nginx instead of a CastleCMS client process!

To learn more about CastleCMS Document Viewer, or building your new site using CastleCMS,

email sales@wildcardcorp.com, or call (715) 869-3440.