This blog explains how to utilize Sitefinity’s native API to extract text from files from document libraries. Common files with text content have extensions - .docx, .xls, .pdf, .csv. This is particularly useful if you are implementing a custom search module. Another use is to extract text for updating summary and other metadata for the files for SEO. Previously we have tried using ‘Adobe iFilter’ to extract text from a PDF document, but it is limited to PDF documents. Also, Adobe iFilter needs to be installed as a plugin in the server and has additional overheads to make it work.
There are a few related Knowledge Base articles provided by Progress for this. It makes sense to Sitefinity’s native API, because those are the same ones Sitefinity uses for indexing these documents internally as well, for the search indexes. Note the disclaimer mentioned in KB articles: this code relies on libraries shipped with Sitefinity. Those libraries can, however, change without notice. If you want API stability, I recommend using a third party PDF library.
Note: Additional libraries are sometimes required to get this working, for example for processing excel:
Telerik.Windows.Documents.Spreadsheet.dll
Telerik.Windows.Documents.Spreadsheet.FormatProviders.OpenXml.dll
Telerik.Windows.Documents.Core.dll
Full source code example is available here:
https://docs.sitefinity.com/tutorial-index-the-contents-of-excel-files
Here are the related KB articles:
https://knowledgebase.progress.com/articles/Article/How-to-programmatically-parse-PDF-documents
https://knowledgebase.progress.com/articles/Article/search-index-csv-file
Here are the default Sitefinity settings for the DocumentService text extraction settings. You can see PDF text extraction libraries are available by default.