Sitefinity Text Extraction from Documents

This blog explains how to utilize Sitefinity’s native API to extract text from files from document libraries. Common files with text content have extensions - .docx, .xls, .pdf, .csv. This is particularly useful if you are implementing a custom search module. Another use is to extract text for updating summary and other metadata for the files for SEO. Previously we have tried using ‘Adobe iFilter’ to extract text from a PDF document, but it is limited to PDF documents. Also, Adobe iFilter needs to be installed as a plugin in the server and has additional overheads to make it work.

There are a few related Knowledge Base articles provided by Progress for this. It makes sense to Sitefinity’s native API, because those are the same ones Sitefinity uses for indexing these documents internally as well, for the search indexes. Note the disclaimer mentioned in KB articles: this code relies on libraries shipped with Sitefinity. Those libraries can, however, change without notice. If you want API stability, I recommend using a third party PDF library.

Note: Additional libraries are sometimes required to get this working, for example for processing excel:

Telerik.Windows.Documents.Spreadsheet.dll
Telerik.Windows.Documents.Spreadsheet.FormatProviders.OpenXml.dll
Telerik.Windows.Documents.Core.dll

Full source code example is available here:

https://docs.sitefinity.com/tutorial-index-the-contents-of-excel-files

Here are the related KB articles:

https://knowledgebase.progress.com/articles/Article/How-to-programmatically-parse-PDF-documents

https://knowledgebase.progress.com/articles/Article/search-index-csv-files

Here are the code snippets which downloads the content of a document stored in Sitefinity and converts its content to text:

PDF:
pdf

Word:
word

Excel Files:
excel files

CSV:
CSV

Here are the default Sitefinity settings for the DocumentService text extraction settings.  You can see PDF text extraction libraries are available by default:

Sitefinity Settings

Contact Us Today!

About Author

ManiBalasubramaniyan
Mani Bala is a Technical lead at Americaneagle.com, with over dozen years of experience in tech. He has Masters Degree in Information Systems, with an Engineering background. He is currently working on Enterprise websites using Sitefinity. He loves problem solving and learning new technologies.


Featured Posts