Building DevJobs
Scraping, LLM integration and Search
I have long wanted to get my hands dirty with web scraping. Now with the almost ubiquitous integration of LLMs in web applications, I thought I would combine the two in a tech exploration. For this I needed a nice little project to work on and ended up with the idea of a "company freedom tracker." In retrospect, it felt completely ridiculous, but it was something to get me started. The gist of it was to have a score board of companies that are explicit in their remote working allowance on their job ads. An LLM would be able to quickly parse a job ad and determine if they say outright that they allow remote work, or if they use a hybrid work mode approach. The worst score in this list would be attributed to companies mandating onsite work. All this is, of course, stupid. If the culture is onsite, then it is there for a reason, and working remotely is not a measure of freedom. Ask me, I work remotely the majority of the time but feel trapped as a bird in a cage. Anyway, this was the core idea, so I started getting to work.
To keep it simple and familiar, I decided to go with Java (25) and Spring Boot (4) as the main development platform (which is what I work with in my day job). For the frontend, I went with the good old Thymeleaf HTML templating engine, paired with HTMX for a snappy dynamic SPA feel. I had not worked with HTMX before either, so this was a fun new technology to try out.
Scraping
So to have something to feed to the LLM, I would need data. Where to get that data from? Initially, I looked at a few different sources to get a good spread of the Swedish job market: Platsbanken (the Swedish government employment agency), Ledigajobb.se, and Jobbsafari.se. I briefly researched whether LinkedIn was an option, but their anti-scraping measures quickly proved to be an absolute Fort Knox, so I dropped that idea.
For the three sites I did target, I built a generic crawler interface with
different parser implementations for each site. A quick search told me that
Jsoup is the go-to library in Java for scraping and HTML parsing. So I added
it to the POM and got to work. But I quickly hit my first bump in the road: many
of these modern sites, like Platsbanken, are React SPAs (Single Page
Applications). Which meant that when I tried to fetch a job ad, I only got an
empty HTML shell and a bunch of JavaScript. If you don't know, SPA apps work in
the following way: you fetch the index.html, and it is a bare-bones HTML
document with an empty body tag and a JS script file. Then what happens is that
the JavaScript is executed, and it builds up the web page by appending elements
to the DOM programmatically, all the while making AJAX calls to the backend to
fetch data content in JSON format. Jsoup was not able to handle this, so I
needed a new approach. Enter Playwright: a headless browser automation
library that downloads and uses Chromium (among others) to simulate browser
interactions and get me the behavior I needed, i.e., execute the JavaScript to
build the DOM so that I get a populated HTML document to parse.
Parsing
Well, now we got the raw HTML document. But it contains hundreds of thousands of characters. Not something you want to feed to an LLM. So we needed to trim it down to the essentials. This meant looking through the structure for each particular site and making some decisions about what to keep and what not to. This required some ugly coding, but once site-specific parsing was encapsulated in my parser implementations, I had something that I was happy with and could move on to the next phase.
The Gold Mine: Platsbanken REST API
Just as I was getting comfortable with this rather heavy Playwright setup and my multi-site scraping architecture, I accidentally stumbled upon an absolute gold mine. It turned out that Platsbanken has a publicly available REST API! Instead of downloading a headless browser, waiting for JavaScript to execute, and scraping the DOM, I could simply make a direct HTTP request to their API and get all the job ads perfectly structured in JSON format.
This was a massive game changer. Platsbanken is the absolute behemoth of job ads in Sweden, largely because many government agencies are mandated to post there, which has caused larger companies to follow suit. Relying solely on their REST API would still easily give me access to around 80% of the entire Swedish job ad market.
I threw out the entire Playwright scraping implementation along with the custom parsers for Ledigajobb.se and Jobbsafari.se. Using structured JSON from a single API not only removed the need for complex and brittle HTML parsing, but it also solved another massive headache: deduplication. When scraping multiple job boards, you constantly run into the same ad posted in different places, which requires complex logic to detect and consolidate. With a single reliable API covering the vast majority of the market, this problem vanished. What started as an ugly scraping workaround turned into a clean API integration, letting me focus on the really fun part: the LLM analysis.
LLM integration
So which model should I use for this? OpenAI seems to have some generous offerings with their cloud API. But that would mean setting up a developer account and getting an API key. Now in the development phase I didn't want to waste any money and be afraid of depleting my quota (which you pre-pay for). So I decided to try my hands with a locally running open source model instead.
Ollama
Fortunately, there is a straightforward way to download an open source model and start using it: ollama. Install its cli tool and pull down one of the MANY models that they make available.
ollama pull gemma3:27b
ollama run gemma3:27b "What is the meaning of life?"Then you can start chatting with it directly in the terminal or make a curl
POST-request to localhost:11434 and have the data streamed to you. This is
also how you integrate with it in your application (in my case, using the Spring
AI chat client library).
I played around with the following models:
❯ ollama list
NAME ID SIZE MODIFIED
qwen2.5:14b 7cdf5a0187d5 9.0 GB 6 days ago
nomic-embed-text:v1.5 0a109f422b47 274 MB 8 days ago
gemma3:12b f4031aab637d 8.1 GB 9 days ago
qwen3:8b 500a1f067a9f 5.2 GB 9 days ago
deepseek-r1:8b 6995872bfe4c 5.2 GB 9 days ago
gemma3:4b a2af6cc3eb7f 3.3 GB 9 days ago
gpt-oss:20b 17052f91a42e 13 GB 9 days ago
gemma3:27b a418f5838eaf 17 GB 10 days agoThe one that distinguished itself was gpt-oss:20b because it is a reasoning model that was able to perform some more complex tasks. However, that was not what I needed for the simple parsing of some text that I needed in this project. I ended up using Google's gemma3:27b, mainly because it was faster (all the reasoning by gpt-oss did take quite a lot of time). Still, even gemma was pretty slow AND, it turns out, really compute heavy. The entire model is loaded into my poor Macbooks RAM, and the CPU is doing all the heavy linear algebra for the forward propagations. Ideally, you would have a powerful Nvidia GPU with its Cuda cores.
In any case, this worked. After having set up a scheduler to fetch the data, I fed the job ad text synchronously to Gemma (with a well-formed prompt). It responded with a verdict about the provided company's work mode (along with its name and a brief description).
Running hot
While this worked, and I was pretty happy with the LLM model's ability to determine work mode: REMOTE, HYBRID, ONSITE, or UNSPECIFIED (if there was not enough data to make a verdict), my macbook was running very hot if I left it running for a while (chewing on ads that the scheduler was feeding it). It was actually the first time I have ever heard the fans start up on it. So much so that I started worrying about it. It is a pretty expensive computer, so the argument to save some money on a cloud bill did not make sense anymore.
Gemini Flash
To save my Macbook from burning up and to be able to parse a meaningful number of ads within a reasonable timeframe, I concluded that I would have to make use of one of the many cloud offerings for LLMs that are available. Since I already have a Google One subscription, I went with them and decided to use gemini-2.5-flash-lite because if it is rapid and the cheapest model, Google has. It also has a large context window, and according to specifications, perfectly suited for my use case.
Pivot
I now realized that I had quite a lot of data accumulated and that simply using it for a work mode score board was not living up to its potential. A thought lurked in my head about job search sites and the fact that I find them pretty mediocre. Usually what I want is to search for tech that I know well and am familiar with, and what kind of product or service the company in question is building. I wanted to put the company front and center, not the specific role that you usually see on job boards. The preview cards should give a brief overview of what the company does, its sector/industry, and its tech stack. All of these things are something that the LLM could easily pick out from the ad and return to me in a structured JSON format (in addition to the work mode, of course). So I went to work on this new idea instead.
Search
Search is something that I have been looking at from afar and had a vague notion about it being more challenging than it looks like on the surface. Firstly, if you have a large amount of text data that you store in a field in the database, and there is a keyword in there that you want to search for, it must be pretty compute expensive to find it unless you can find a way to index it. However, such an index would potentially grow extremely large if you don't find a smart way. A normal btree index would have to match the whole text in the field, which would not work. Also, how can you be sure of its relevance? I have used OpenSearch and ElasticSearch in my day job (Full Text Search tools/engine), but never bothered to research them in detail. Could there be a simpler and smarter way?
Text Embeddings
I had heard about vector embeddings and semantic search. Could that be something
that I could apply to this site? Without doing barely any research, I and
deleted my current PostgreSQL database and spun up a new one with the
pg_vector extension (deployed in a Docker container, of course). Then
for each ad I ingested and parsed through the llm, I computed its corresponding
(768- dimensional) vector using nomic-embed-text:v1.5 (also via ollama
running locally). I stored this in a vector(768) column on the company_table.
The idea was that someone could do a semantic search, e.g.:
I'm looking for a company in the defence industry which uses java and springboot as a stack.
Then I would run this search term through the embedding model and get a
corresponding vector and compare it to the vector stored on that company derived
from the formatted output from the LLM, using one of the vector embedding
distance algorithms (which the pg_vector Postgres extension supports). I went
with cosine distance which compares the angle between the vectors in this
high-dimensional space. Irrespective of length (the theory went), they would
point in a similar direction. This turned out to be completely wrong. The
results were practically random. Back to the drawing board.
PostgreSQL FTS
So I would have to go back to "normal" search. Whatever that was. Remember the
indexing problem I had expected above? Well, it turns out some smart people have
already thought about this. Postgres nativly supports Full Text Search (FTS)
with Generalized Inverted Indicies. How it works is that when you store the blob
of text in a field (I choose the optimized TSVECTOR type), Postgres will
tokenize the blob and each word would be indexed to the corresponding row id.
Still, the index would become huge if we store a lot of disparate words right?
Well, someone has thought of that too. Enter stemming. Before indexing, and
given that you use the english stemming keyword, Postgres will automaticlly
stem the word to its simplest form (e.g., running -> run) and normalize the
case, in addition to removing all stop words ("like", "the", "a", etc.). This
will significantly reduce the index size. And we get all this for free by using
Postgres.
I implemented this, and now we had a basic search in place. Of course, it was just a normal search, i.e., you would have to spell every word correctly, and it would not be smart in any sense of the word.
Trigram fuzzy matching
With the advent of LLMs we have become pretty spoiled in terms of our spelling. An LLM will understand you perfectly even if you spell like a rake (yeah, that is a Swedish expression, but I think you get the gist). Could I do something to make the search a little less strict? After some research I found pg_trgm, Postgres trigram matching. This is what I was looking for. How it works is that Postgres will break each stored word into three-character sequences: "robin" → " r", " ro", "rob", "obi", "bin", " in". Then for matching, it measures the overlap with the search term which is split up in the same manner. The overlap is compared against a threshold which I define myself (depending on how loose you want the matching to be). Like for FTS, trigram search can be applied against a blob of text, which can also be indexed with GIN.
After some experimentation I ended up with setting the threshold pretty high, opting into catching fewer spelling mistakes but still getting the nice value of prefix matching, e.g., the User types "Friday" and you would still match a company named "Friday Väst AB". Spelling mistakes on tech terms turned out to require some extra work (done independently in the java code before hitting the database, but in the service of brevity I will omit describing that part in this blog post).
So now I had a two-tiered search strategy. First, see if you got any matches on the full text search. If not, run it through the trigram search, and hopefully, I catch prefix matching and spelling mistakes, for a better search experience.
Prompt Engineering
The temptation with LLMs is to extract everything you can imagine because the model will always give you something. But an LLM that confidently outputs wrong data is worse than no data at all. Users can tolerate missing fields. They cannot tolerate fields that are obviously wrong — that's when "AI slop" becomes the brand. I quickly ran into this issue with my data. Sometimes it would even mislabel a company because the ad was a consultancy firm recruiting for another company. This made me almost give up the project entirely. However, the trick is to give very precise instructions to the LLM and a set of examples for it to work with. Below is an excerpt from my prompt regarding the rules for extracting the tech stack:
CRITICAL — Stack extraction rules:
- ONLY include technologies that are EXPLICITLY MENTIONED in the ad text.
- Do NOT guess, infer, or assume technologies based on the company name, industry, or your world knowledge.
- Each stack array should reflect ONLY what THIS SPECIFIC job posting asks for.
- If the ad does not mention any databases, leave databases as []. Same for all categories.
- If the ad says "we work with modern web technologies" without naming specifics, do NOT fill in guesses.
- Include version numbers when explicitly stated (e.g. "Java 21" not just "Java").
CRITICAL — Category classification rules:
- "languages" = programming languages: Java, Python, C#, JavaScript, TypeScript, Go, Rust, Kotlin, Scala, Ruby, PHP, etc.
- "frameworks" = frameworks, runtimes, platforms, libraries: Spring Boot, React, .NET, Node.js, Django, Angular, Vue.js, Next.js, Express, Flask, etc.
- Node.js is a RUNTIME — put it in "frameworks", NEVER in "languages".
- .NET is a PLATFORM/FRAMEWORK — put it in "frameworks", NEVER in "languages".
- React, Angular, Vue.js are FRAMEWORKS, NEVER "languages".
- Docker and Kubernetes are TOOLS, NEVER "frameworks".
- AWS, Azure, GCP are CLOUD providers, NEVER "tools" or "other".
- PostgreSQL, MongoDB, Redis are DATABASES, NEVER "tools" or "other".Second guard: Tech validator and normalizer
However, even this was not enough, so I needed a second guard against ingesting bad data. The solution was a TechValidator implementation that keeps a list of the 400 most common technologies in a hashmap and the category to which they belong. If the LLM mistakenly put a database in the framework category, the validator would catch it and put it in the correct category. In addition, the output was first run through a normalizer which would catch variations in spelling of different technologies and map it to the one true representation. Every user search is also run through the normalizer before hitting the db. If the technology outputted by the db did not match anything, it would instead go into the junk bucket and not be recorded. However, I did make sure to also save these to an unknown technology database table with a counter so that I could monitor that and see if I needed to add a term that came up a lot to the validator.
The Swedish consultancy firm issue
Perhaps the hardest and most valuable feature of this app is the consultancy categorization that it does. One thing that I often want to omit when I search for jobs are consultancy firms. I have absolutely no interest in being employed by such a firm again (I have my own if I want to rent myself out as a consultant). The issue is that the Swedish tech job market is just crawling with such firms. To further complicate things, these firms sometimes put ads for positions in their own firm (the client you will be hired out to is resolved after you get employed by them). Sometimes they put out ads recruiting on behalf of a product company. This latter behavior can be categorized into either named recruitment or anonymous recruitment. If the company is named, and we have it in our database, then I could fuzzy match on the company name to and add the extracted technology to that company, and I could add an indirect link on that company's page to the consultancy firm's ad. However, if the company was not named, then the ad would only be listed in a "Recruiting for" section under the consultancy firm. But the tech would still be registered for the general trends and technology (more on that below).
To omit consultancies from a search, I explicitly added a filter for it. If you
don't want to see consultancy firms, just add no consultancy to your search
and they will not be shown.
Trends and tech
All the data collected in these ads can also be used to visualize trends over time. Therefore, I added a trends page, which displays useful information on an aggregate level. Also, there is a tech page for each of the technologies that we persist. It also gives aggregate data and lists all companies that are currently using that tech.
Frontend GUI
The frontend server side rendered with a Thymeleaf templating engine and using HTMX to give it a snappy SPA-like feel. I had not used HTMX before as I'm quite fluent in React. But it was great not to have to bring in such a heavy lib/framework. This is basically a pure Springboot app.
Deployment
When it came time to put this into production, I wanted a lean, cost-effective setup that didn't feel like a full-time DevOps job to maintain. I rented a beefy yet affordable Hetzner CPX32 VPS (4 vCPUs, 8GB RAM, 160GB NVMe SSD), giving the application plenty of breathing room. Far more bang for the buck compared to an AWS EC2 instance for example.
For orchestration, I opted for Dokploy — an excellent, lightweight PaaS alternative that runs directly on your own hardware.
CI/CD Pipeline
To automate deployments, I set up a straightforward continuous integration pipeline using GitHub Actions. Whenever code is merged to the main branch, GitHub spins up a runner to run the test suite and build the Docker image.
Once the image is ready, GitHub pushes it to the GitHub Container Registry (GHCR) and fires off a webhook to my Dokploy server. Dokploy catches the webhook, pulls the fresh image, and restarts the application seamlessly.
Infrastructure as Code
Under the hood, everything is orchestrated via a single
docker-compose.prod.yml file passed directly into Dokploy. This composition
includes both the Spring Boot app and our pgvector enabled PostgreSQL 18
database (yes I kept that image incase I need it in the future).
Because everything is defined via Compose, I was able to attach native Traefik
labels directly to the application container. This lets Traefik (which Dokploy
uses internally) automatically handle the reverse proxy routing for
devjobs.docksidelabs.se and dynamically provision Let's Encrypt SSL
certificates.
Backups and Observability
Data resilience is key, so I mounted the Postgres volume out to the host disk and configured automated backups that are shipped off-site to an S3 bucket.
For monitoring, I set up Grafana Alloy as an agent alongside the stack. It scrapes the Prometheus metrics exposed by Spring Boot's Actuator (tracking JVM stats, database connection pools, and custom LLM extraction counters) and forwards them straight to my Grafana dashboard. Finally, I configured Docker log rotation to ensure that the server's SSD doesn't quietly fill up over time.
It's a tightly integrated, highly observable, and practically hands-off production environment.
Full stack
See the final result here: devjobs.docksidelabs.se.
- Language: Java 25
- Framework: Spring Boot 4.0.2
- Database: PostgreSQL 18
- Frontend: Thymeleaf + HTMX 2.0 (High-Performance Hypermedia)
- Analysis Model:
gemini-2.5-flash-lite(via Google GenAI & Spring AI) - Infrastructure: Docker & Docker Compose
- Observability: Micrometer & Prometheus
- Security: Spring Security and Bucket4j (Rate Limiting)
