Optimizing Search Architecture

I spent a session rebuilding the search in my job board app. The old version had a handful of problems that were easy to miss individually but added up to genuinely bad results. Here's what I found and fixed.

The Setup

The app crawls Swedish job sites, extracts structured data (tech stack, company info) using an LLM, and lets you search for companies by technology. Search runs on PostgreSQL only (no Elasticsearch, no external search service). Just two tiers: full-text search (FTS) for exact keyword matches, and trigram similarity as a fuzzy fallback for typos and partial words.

Category Labels in the Search Index

The search text for each company (which was extracted from the tech profile produced by the LLM, and stored in a db column) could look like this: Languages: Java 21. Frameworks: Spring Boot. Tools: Docker. This meant searching for "tools" matched every single company that had any tools listed. The word "tools" was in the index for all of them. Same for "languages", "frameworks", etc. Strip all category labels. The search text is now just a flat list: Java 21 Spring Boot Docker. The labels were for humans reading the data — the search engine doesn't need them.

No Stemming

I was using PostgreSQL's simple text search config, which does zero linguistic processing. Searching "developing" wouldn't match a company description containing "develop". Switching to english config enabled stemming, so word variations now match correctly.

Side note

I was originally using a text-embedding model running locally on ollama (nomic-embed-text:v1.5) and the pg_vector extention to produce a similarty vector from the entire search text. But its application to this problem was not the right choice, as it gave nearly random results on my trial runs (how I wanted to use the app). The cosine distance of the vectors did not encapsulate what I wanted in terms of search efficacy. I will have to investigate other use cases for this, because it was quite simple to set up, and the embedding model I used was lightning fast. So was the the postgres query for finding similarly angled vectors (cosine distance)

Everything Ranked Equally

Company name, tech stack items, and a 200-word company description all had the same weight in search results. A company called "Spring" should rank higher for the query "Spring" than a company that happens to mention spring in a paragraph about their office renovation. PostgreSQL's tsvector supports field-level weights (A through D). Company name gets the highest weight, tech stack gets medium, description gets the lowest.

The OR vs AND Trap in Fuzzy Search

This was the sneaky one. The trigram tier tokenizes multi-word queries and fuzzy-matches each word separately. My first implementation used OR logic: if any word matched, the company showed up. Searching "health tech" returned basically every company in the database. Why? Because the word "tech", a 4-letter prefix of "technology", fuzzy-matched against nearly everything. The "health" part was ignored entirely. Switching to AND logic fixed it immediately. Now all words must fuzzy-match, consistent with how FTS works. Trigram's job is to catch typos like "Postgre" -> "PostgreSQL", not to do semantic search. If you search "health tech" and a company has nothing related to health, it shouldn't show up, even if it's a tech company.

The Takeaway

Search is deceptively hard. Each of these issues seemed small in isolation, but together they made results feel random. The biggest lesson: be very deliberate about what goes into your search index and how multi-word queries combine. OR semantics in fuzzy search sounds generous and user-friendly until you realize it makes every short common word a wildcard that matches everything.