Member-only story

How to read files in Elasticsearch? (doc, docx, pdf)

--

You will probably be surprised by this information. Elasticsearch is used for … searching. Yes. This is true. It turns out that it can also be used to index the contents of doc, docx, pdf files, etc. In this post, we’ll look at how to do it, how to change the analyzer, and how to “lose” a file if we keep it on S3 or other filesystem.

What for?

Not always the phrase we are looking for is in the file name, title and other metadata provided. Imagine a portal that allows you to collect and search for scientific articles. Each article is a separate PDF file. Adding abstracts to the search area would already noticeably increase the usability of the portal. ScienceDirect uses Elasticsearch for a reason.

Environment

To enable file analysis in Elasticsearch, you need Ingest Attachment Processor Plugin. In the case of Docker, we could manually enable the terminal inside the container with the following command.

sudo docker exec -it container-name bash

This is a bad idea. If the container is removed, we will have to repeat this action. To avoid this, I slightly modified Docker Compose with ELK from the Docker entry and added a simple Dockerfile.

FROM docker.elastic.co/elasticsearch/elasticsearch:7.6.0RUN bin/elasticsearch-plugin install --batch ingest-attachment

So docker-compose.yml looks something like this:

version: '2.2'services:elasticsearch:build: ./custom-elasticsearch/#    image: docker.elastic.co/elasticsearch/elasticsearch:7.6.0restart: unless-stopped...

Pipeline preparation

What is Pipeline at all? This is the definition of a series of processors that will be performed in the same order in which they were declared. In other words, the document that we throw into the base will be passed through each defined processor. We can thus enrich the document with new fields, transform and even delete it if the defined condition is met.

--

--

Maciej Szymczyk
Maciej Szymczyk

Written by Maciej Szymczyk

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

No responses yet

Write a response