Skip to main content

Command Palette

Search for a command to run...

I Scored 100% on a Safaricom Codility Test. The Practical Part Was Elasticsearch. Let's Talk About It.

Published
9 min read
I Scored 100% on a Safaricom Codility Test. The Practical Part Was Elasticsearch. Let's Talk About It.

I'm a cloud engineer. My 2026 goal is to land an internship at a Big Tech company. That's what I've been building toward. Upskilling in tech, grinding DSA, the whole thing.

Applying for an attachment at Safaricom was almost a side quest. I was so nervous going in, with imposter syndrome (I have it even more rn as I'm waiting for the next steps). I had been grinding arrays, trees, and dynamic programming for weeks. None of that came.

There were 2 tasks, and task 2 was about the "Elasticsearch Search API."

I paused. Then I smiled. Elasticsearch had been on my learning track for a while. Funny how that works.

All tests passed. 100%.

But this post isn't really about the test. It's about Elasticsearch. What it is, how it actually works, and why I think every developer building anything at scale should understand it.


What is Elasticsearch?

Elasticsearch is a distributed, open-source search and analytics engine built on top of Apache Lucene. It stores data as JSON documents and lets you search, analyze, and visualize it at scale — in near real-time.

It's the E in the ELK Stack (Elasticsearch, Logstash, Kibana).

Why Does Elasticsearch Even Exist?

Here's the problem it solves.

You have a blog. Thousands of posts. A user types "nairobi tech" in your search bar. The naive solution:

SELECT * FROM posts WHERE body LIKE '%nairobi%' OR body LIKE '%tech%';

That works. Until you have 10 million posts. Then it crawls. Then it times out. Then your users leave and never come back.

The issue isn't hardware. It's the approach. A relational database is optimised for retrieving exact records. It's not built for finding everything relevant to a phrase across millions of rows.

Elasticsearch solves this with an inverted index.

Think of the index at the back of a textbook. Instead of reading every page to find where "Nairobi" is mentioned, you flip to the index, and it tells you exactly which pages. Elasticsearch builds that index on your data automatically, keeps it updated in real time, and when you search, it doesn't scan. It looks up.

That's why it returns results in milliseconds even at massive scale.


The Inverted Index, Visualised

Here's what happens when you index three blog posts:

Word Documents
nairobi Post 1, Post 2
tech Post 1, Post 3
startups Post 2
booming Post 1
mombasa Post 3

Search "nairobi tech" → hit the index → instantly know Post 1 and Post 2 are relevant. No scanning. No guessing. That's the entire trick.


Elasticsearch vs SQL

Before touching any code, lock this in:

Elasticsearch SQL In context
Index Table blog_posts
Document Row One blog post
Field Column title, body, author
Mapping Schema Field type definitions
Query SELECT What you ask
Shard Partition A chunk of the index

The key difference: in SQL, you retrieve. In Elasticsearch, you search and rank. Results come back with a relevance score. Not just found or not found, but how well does this match?


Setting Up Locally

Fastest way to get running, one command:

curl -fsSL https://elastic.co/start-local | sh

This spins up Elasticsearch on port 9200 and Kibana on port 5601. Kibana is the UI that sits on top. Browse your data, build dashboards, and run queries visually. Start with raw HTTP requests first. Understanding what's happening underneath makes Kibana make sense.

Verify it's alive: (I used Postman)

GET http://localhost:9200

You'll get back cluster info. Remember to authenticate with your basic creds or API Key provided during set-up


Creating an Index (and Why Mappings Matter)

Creating an index is like defining a schema, except you're also telling Elasticsearch how to treat each field. This is the most important decision you'll make.

PUT /blog_posts
{
  "mappings": {
    "properties": {
      "title":        { "type": "text" },
      "body":         { "type": "text" },
      "author":       { "type": "keyword" },
      "tags":         { "type": "keyword" },
      "published_at": { "type": "date" },
      "views":        { "type": "integer" }
    }
  }
}

Two field types are doing completely different jobs:

text gets analysed before indexing. "Building Tech Communities in Nairobi" becomes ["building", "tech", "communities", "nairobi"]. This powers natural language search.

keyword is stored exactly as-is. No analysis. Used for exact matching, filtering, sorting, and aggregations.

If you map author as text when you meant keyword, filtering by exact author name won't work cleanly. And you can't change field types on an existing index without reindexing everything. Map it right the first time.


Now insert some documents to play with. Each POST to _doc creates one document with an auto-generated ID:

POST /blog_posts/_doc
{
  "title": "Building tech communities in Nairobi",
  "body": "The Nairobi tech ecosystem has grown significantly. Developer communities like GDG are driving this growth through events and mentorship.",
  "author": "Alex Nyambura",
  "tags": ["community", "nairobi", "devrel"],
  "published_at": "2025-03-01",
  "views": 340
}
POST /blog_posts/_doc
{
  "title": "Getting started with Kubernetes on GKE",
  "body": "Google Kubernetes Engine makes deploying containerized apps straightforward. This guide walks through setting up your first cluster.",
  "author": "Alex Nyambura",
  "tags": ["devops", "kubernetes", "gcp"],
  "published_at": "2025-04-15",
  "views": 820
}
POST /blog_posts/_doc
{
  "title": "ISP infrastructure in rural Kenya",
  "body": "Last-mile internet connectivity remains a challenge. Companies like Wakanet are bridging the gap in underserved regions.",
  "author": "Felix Jumason",
  "tags": ["infrastructure", "kenya", "internet"],
  "published_at": "2025-05-10",
  "views": 210
}

Confirm they're in:

GET /blog_posts/_count

You should see "count": 3. Now you have something real to search against.


Searching: The Three Patterns You'll Use Most

1. Full-text search with match:

GET /blog_posts/_search
{
  "query": {
    "match": {
      "body": "nairobi tech community"
    }
  }
}

match tokenizes your query the same way it tokenized the document at index time, then scores matches using BM25. Results come back ranked by relevance, highest first.

2. Search across multiple fields with multi_match:

GET /blog_posts/_search
{
  "query": {
    "multi_match": {
      "query": "kenya",
      "fields": ["title", "body", "tags"]
    }
  }
}

Real users don't know which field their keyword lives in. multi_match is what you actually want behind a search bar.

3. Combining conditions with bool:

GET /blog_posts/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "body": "kubernetes" } }
      ],
      "filter": [
        { "term": { "author": "Alex Nyambura" } },
        { "range": { "views": { "gte": 500 } } }
      ]
    }
  }
}

This is the pattern you'll use 80% of the time in production. must is for relevance and affects the score. filter is for hard conditions that don't affect scoring and is faster because Elasticsearch caches filters.

relevance goes in must, hard conditions go in filter.


Understanding Scores (BM25 in Plain English)

Every result has a _score. Three things push it up:

Term frequency: "Nairobi" appears 5 times in document A, once in document B. A wins.

Inverse document frequency: "Nairobi" only appears in 2 of 1000 documents, so it's a strong signal. "The" appears in all 1000, so searching for it tells you nothing. Common words carry low weight automatically.

Field length: "Primo Levi" in a 2-word author field is a stronger match than "Primo Levi" buried in a 500-word body.

Scores are only meaningful relative to each other within the same query. There's no fixed scale. A score of 1.9 in one query means nothing compared to 1.9 in a different query.


Pagination, Sorting, and Field Filtering

Return only the first 2 results:

{
  "size": 2,
  "from": 0,
  "query": { "match_all": {} }
}

size = how many to return. from = offset. Page 2 would be "from": 2. The total hit count always reflects the full match count, not just what's shown on the current page.

Sort by year descending:

{
  "sort": [{ "year": { "order": "desc" } }],
  "query": { "match_all": {} }
}

When you sort by a field, _score becomes null. There's no relevance calculation. You're ordering, not ranking.

Return only specific fields:

{
  "_source": ["author", "title"],
  "query": { "match_all": {} }
}

Never return entire documents when you only need two fields. Especially at scale.


What's Next From Here

If this sparked something:

  • Fuzzy search: "fuzziness": "AUTO" and Elasticsearch handles typos automatically

  • Aggregations: the analytics side. "Top 5 authors by post count", "views by month". SQL's GROUP BY but distributed and fast

  • Analyzers: customise how text gets tokenized. Build autocomplete, handle edge cases

  • Kibana: the UI that ships with Elasticsearch. Great for exploring data and visualising aggregations once you understand the raw query layer


One More Thing

I'm still in the middle of the Safaricom internship application process. There's more to that story and I'll write the full thing once it's wrapped up. The application process, what the assessment was actually like, what I'd tell someone going in cold.

For now, Elasticsearch isn't magic. It's a really smart index with a clean query API. Once that clicks, you'll start seeing use cases for it everywhere.

And if you're preparing for a technical assessment at any company that deals with large amounts of data, add it to your list. It came up for me. It might come up for you.


Let's connect on LinkedIn or check out more posts at blog.lxmwaniky.me. I write about cloud, Software, and building things in the Kenyan tech ecosystem.

More from this blog

A

Alex Talks Tech – Real-World Cloud, Software Engineering Insights

26 posts

"Alex Talks Tech" is my journey through the tech world. I share insights, tool breakdowns, and experiences from Software Engineering, and Cloud Infrastructure.