Sitefinity Text Extraction from Documents

This blog explains how to utilize Sitefinity’s native API to extract text from files from document libraries. Common files with text content have extensions - .docx, .xls, .pdf, .csv. This is particularly useful if you are implementing a custom search module. Another use is to extract text for updating summary and other metadata for the files for SEO. Previously we have tried using ‘Adobe iFilter’ to extract text from a PDF document, but it is limited to PDF documents. Also, Adobe iFilter needs to be installed as a plugin in the server and has additional overheads to make it work.

There are a few related Knowledge Base articles provided by Progress for this. It makes sense to Sitefinity’s native API, because those are the same ones Sitefinity uses for indexing these documents internally as well, for the search indexes. Note the disclaimer mentioned in KB articles: this code relies on libraries shipped with Sitefinity. Those libraries can, however, change without notice. If you want API stability, I recommend using a third party PDF library.

Note: Additional libraries are sometimes required to get this working, for example for processing excel:

Telerik.Windows.Documents.Spreadsheet.dll
Telerik.Windows.Documents.Spreadsheet.FormatProviders.OpenXml.dll
Telerik.Windows.Documents.Core.dll

Full source code example is available here:

https://docs.sitefinity.com/tutorial-index-the-contents-of-excel-files

Here are the related KB articles:

https://knowledgebase.progress.com/articles/Article/How-to-programmatically-parse-PDF-documents

https://knowledgebase.progress.com/articles/Article/search-index-csv-file


Here are the default Sitefinity settings for the DocumentService text extraction settings.  You can see PDF text extraction libraries are available by default.

About the Author

staff at americaneagle.com

Americaneagle.com
Staff

Americaneagle.com has a dedicated team of strategists, technologists, and content writers to help you stay up to date with the latest and greatest trends in the technology industry. We cover a wide variety of topics on a regular basis, some of which include website design, website development, digital marketing services, ecommerce, accessibility, website hosting and security, and so much more. Educating our clients, prospects, and readers is very important to us and we appreciate the opportunity to be an authoritative voice in the industry.
View All Posts

Featured Posts