I Built a Telegram Bot to Help Me Search for Jobs

2023-12-08

Category: Project

"Looking for a job is a full-time job." It's true, kind of. Being on the hunt for a new position, I realized I was opening the same websites, sometimes multiple times per day, applying the correct filters, trying to find which jobs are relevant to me. This process was both time and energy-consuming, and being a developer, I thought there has to be a way for me to automate this process. I decided to try to create a telegram bot that would automatically scrape my favourite job sites for relevant jobs and send them to me once per day. One thing about web scraping is that websites often take measures to make it hard for you. They don't want to make their precious data available to the public without you having to go through their official channels. Challenge accepted.

I started out googling solutions for specific websites. If someone else has already figured it out, why not use their code as a starting point? Luckily, I found a blog post that has a solution for scraping LinkedIn jobs, so I decided to try this approach out and see what data I could get back.

How to Scrape LinkedIn Legally

Without going too much into detail (you can read the whole thing in the link above), I created a new node project and started playing around. As it turns out, this solution worked very well! So, in this approach, we use cheerio and axios to access the public LinkedIn jobs page. From there, we check the network tab and look for the network request that LinkedIn uses to load more jobs when you scroll down the page. We can use the request URL and create a for loop that loops over the pages of jobs, 25 posts at a time. Here is an example of the URL I ended up using:

for (let pageNumber = 0; pageNumber < 250; pageNumber += 25) {
    const url = `
        https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search
        ?keywords=Javascript
        &location=Stockholm%2C%2BStockholm%2BCounty%2C%2BSweden
        &locationId=&geoId=100907646
        &f_TPR=r604800&f_PP=103264854
        &f_WT=1%2C3&f_JT=F&start=
        ${pageNumber}
    `;

    // DO SCRAPING HERE
}

As you can see, the URL contains my keywords and the location followed by some codes representing LinkedIn's search filters. In my case, it filters based on when the job was posted and if the jobs should be sorted by date or relevance. This url will change depending on what search filters you set on the jobs page. We will send one request for every iteration, and in my case, I noticed that LinkedIn started complaining after too many requests, so I decided to limit the number of pages scraped to 10 (250/25), whereas in the original article, that number is 40 (1000 / 25). This makes more sense since I'm only searching for recently posted jobs, and searching 40 pages is a bit too much.

From here, I used axios to fetch the pages and cheerio to get the page contents. I then extracted the relevant information like Title, Date, Link, Company and sent it back as a JSON response. Nice, all done, right? Wrong.

I quickly noticed that the jobs that were being returned are not necessarily relevant to me. I got quite a few jobs that ask for DevOps, Java developers, Backend engineers, etc. On top of this, the sorting is not right, and I get back some duplicates if the jobs were posted multiple times. I created a few functions to solve this:

const jobHasRelevantWordsInTitle = (title: string): boolean => {
    // ADD MORE KEYWORDS HERE IF NEEDED

    const wordsToCheck = [
        'Frontend',
        'Front end',
        'Javascript',
        'Typescript',
        'React'
    ];

    const wordsToExclude = [
        'Backend',
        'Junior',
        'Lead',
        'Devops',
        'C#',
        '.NET'
    ];

    const containsRelevantKeyWord = wordsToCheck.some((word) =>
        title.toLowerCase().includes(word.toLowerCase())
    );

    const doesNotContainExcludedWord = wordsToExclude.every(
        (word) => !title.toLowerCase().includes(word.toLowerCase())
    );

    return containsRelevantKeyWord && doesNotContainExcludedWord;
};

I simply created a function that checks every item of the returned array for relevant and irrelevant keywords and only returns the ones that I'm interested in. This worked perfectly! Now I just had to remove the duplicates:

const removeDuplicates = (
    array: Job[],
    key1: string,
    key2: string
): Job[] => {
    const uniqueCombinations = new Set<string>();

    return array.filter((item) => {
        const combination = `${item[key1]}|${item[key2]}`;
        return (
        !uniqueCombinations.has(combination) &&
        uniqueCombinations.add(combination)
        );
    });
};

This function checks for items that has the same values based on the parameters provided. In my case I used "Title" and "Comapny". So far, so good!

const ONE_AND_A_HALF_DAYS_IN_MILLISECONDS = 36 * 60 * 60 * 1000;

const jobIsNew = (datePosted: string): boolean => {
    const currentDate = new Date();
    const twoDaysAgo = new Date(
        currentDate.getTime() - ONE_AND_A_HALF_DAYS_IN_MILLISECONDS
    );
    return new Date(datePosted) > twoDaysAgo;
};

Lastly, I set up a function that will make sure that the job was not posted more than 36 hours ago. This was originally 48 hours but I found myself receiving the same jobs two days in a row (obviously). I still wanted some buffer so I adjusted it to 36 hours.

Now I needed an endpoint to request this information from. I wanted to avoid using express and setting up a node server and decided to go for a serverless approach using Vercel edge functions. This allows us to add an api folder in the root of our project, create a file inside of the folder, name it whatever we want the route name to be, and an api route is set up for us. Much like how we create routes in Next.js. This required me to install vercel-cli and run the command vercel dev. My endpoint can now be reached via localhost:3000/api/notify, magic! 🧙

So, where are we at now? I have created an endpoint that when called will scrape LinkedIn for recent, relevant jobs and return it as a JSON response. The next step is to set up a telegram bot that will send me a notification when I hit the endpoint. I found a great article that explained how to set everything up here and used it as a starting point:

Creating a Simple Telegram Bot using Node.js

const sendTelegramMessage = async (jobs: Job[]): Promise<void> => {
    const token = process.env.TELEGRAM_API_KEY || '';
    const chatId = process.env.TELEGRAM_CHAT_ID || '';

    const bot = new TelegramBot(token, { polling: false });

    const message = jobs
        .map(
        (job) => `${job.Date}\n[${job.Title}](${job.Link}) at *${job.Company}*`
        )
        .join('\n\n');

    const markdownMessage = `🚀 *New relevant jobs (past 36h):* 🚀\n\n${message}`;
    const messageConfig: TelegramBot.SendMessageOptions = {
        parse_mode: 'Markdown',
        disable_web_page_preview: true
    };

    await bot.sendMessage(chatId, markdownMessage, messageConfig);
};

Using the guide above I ended up with this function that will be called after I have fetched the jobs. It will take the array of jobs as a parameter and send me a custom telegram message. One thing to note here is that I started a chat with the bot after I created it and then used that specific Chat ID in the code above. You can get the chat ID by sending a GET request to "https://api.telegram.org/bot<BOT_TOKEN>/getUpdates". This worked perfectly:

At this point I was happy with the result, but I wanted to add data from one more source, the swedish public employment service. This one was tricky but after looking through the network requests tab I found a url I could use to get the latest jobs posted. What's great with this website is that they also fetch jobs from external websites and I found a way to tap into that endpoint as well. I now had multiple new sources for job ads and I started to create a function to fetch this data as well. I used the same keyword and duplication filters and returned the new jobs with the same structure as the linkedin jobs. I then merged the arrays and voila! My project was ready to push to the web.

The last step was to setup a CRON-job that would run once per day. I found this to be very simple using Vercel.

How to Setup Cron Jobs on Vercel

I created a job that call my notify endpoint every day at 10am excluding sundays and mondays. I did some testing and everything worked surprisingly well!

This project was relatively quick and easy to setup and will save me a bunch of time and energy in my job search. I also got the chance to play around with telgram bots and cron jobs using vercel.

Thank’s for reading! ✌️