Optimizing Search Architecture
What I learned rebuilding search in my job board app, cleaning up noisy indexes, adding stemming and field weights, and why OR logic in fuzzy search is a trap.
▸More posts (3)
I spent a session rebuilding the search in my job board app. The old version had a handful of problems that were easy to miss individually but added up to genuinely bad results. Here's what I found and fixed.
The Setup
The app crawls Swedish job sites, extracts structured data (tech stack, company info) using an LLM, and lets you search for companies by technology. Search runs on PostgreSQL only (no Elasticsearch, no external search service). Just two tiers: full-text search (FTS) for exact keyword matches, and trigram similarity as a fuzzy fallback for typos and partial words.
Category Labels in the Search Index
The search text for each company (which was extracted from the tech profile
produced by the LLM, and stored in a db column) could look like this:
Languages: Java 21. Frameworks: Spring Boot. Tools: Docker. This meant
searching for "tools" matched every single company that had any tools listed.
The word "tools" was in the index for all of them. Same for "languages",
"frameworks", etc. Strip all category labels. The search text is now just a flat
list: Java 21 Spring Boot Docker. The labels were for humans reading the data
— the search engine doesn't need them.
No Stemming
I was using PostgreSQL's simple text search config, which does zero linguistic
processing. Searching "developing" wouldn't match a company description
containing "develop". Switching to english config enabled stemming, so word
variations now match correctly.
Side note
I was originally using a text-embedding model running locally on ollama
(nomic-embed-text:v1.5) and the pg_vector extention to produce a similarty
vector from the entire search text. But its application to this problem was not
the right choice, as it gave nearly random results on my trial runs (how I
wanted to use the app). The cosine distance of the vectors did not encapsulate
what I wanted in terms of search efficacy. I will have to investigate other use
cases for this, because it was quite simple to set up, and the embedding model I
used was lightning fast. So was the the postgres query for finding similarly
angled vectors (cosine distance)
Everything Ranked Equally
Company name, tech stack items, and a 200-word company description all had the
same weight in search results. A company called "Spring" should rank higher for
the query "Spring" than a company that happens to mention spring in a paragraph
about their office renovation. PostgreSQL's tsvector supports field-level
weights (A through D). Company name gets the highest weight, tech stack gets
medium, description gets the lowest.
The OR vs AND Trap in Fuzzy Search
This was the sneaky one. The trigram tier tokenizes multi-word queries and fuzzy-matches each word separately. My first implementation used OR logic: if any word matched, the company showed up. Searching "health tech" returned basically every company in the database. Why? Because the word "tech", a 4-letter prefix of "technology", fuzzy-matched against nearly everything. The "health" part was ignored entirely. Switching to AND logic fixed it immediately. Now all words must fuzzy-match, consistent with how FTS works. Trigram's job is to catch typos like "Postgre" -> "PostgreSQL", not to do semantic search. If you search "health tech" and a company has nothing related to health, it shouldn't show up, even if it's a tech company.
The Takeaway
Search is deceptively hard. Each of these issues seemed small in isolation, but together they made results feel random. The biggest lesson: be very deliberate about what goes into your search index and how multi-word queries combine. OR semantics in fuzzy search sounds generous and user-friendly until you realize it makes every short common word a wildcard that matches everything.
