How to read files in Elasticsearch? (doc, docx, pdf)

Maciej Szymczyk
5 min readMay 27, 2020

You will probably be surprised by this information. Elasticsearch is used for … searching. Yes. This is true. It turns out that it can also be used to index the contents of doc, docx, pdf files, etc. In this post, we’ll look at how to do it, how to change the analyzer, and how to “lose” a file if we keep it on S3 or other filesystem.

What for?

Not always the phrase we are looking for is in the file name, title and other metadata provided. Imagine a portal that allows you to collect and search for scientific articles. Each article is a separate PDF file. Adding abstracts to the search area would already noticeably increase the usability of the portal. ScienceDirect uses Elasticsearch for a reason.

Environment

To enable file analysis in Elasticsearch, you need Ingest Attachment Processor Plugin. In the case of Docker, we could manually enable the terminal inside the container with the following command.

sudo docker exec -it container-name bash

This is a bad idea. If the container is removed, we will have to repeat this action. To avoid this, I slightly modified Docker Compose with ELK from the Docker entry and added a simple Dockerfile.

FROM docker.elastic.co/elasticsearch/elasticsearch:7.6.0

--

--

Maciej Szymczyk
Maciej Szymczyk

Written by Maciej Szymczyk

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

No responses yet