Indeed Job Scraper and Analysis

8 min readNov 16, 2020

INDEED JOB SEARCH

How to find and analyse a job in the UK market. Using the code below you too can start collecting and analysing the results in your hometown.

How to predict a data related jobs salary in the UK

The first task at hand was to collect data that might be relevant to the question. The following fields that were most applicable after looking through the Indeed webpage were:

Job title:
Location:
City: (this relates to the city which the data was scraped from)
Company:
Salary:
Review:
Link: (in order to get full descriptions of the job)

Beautiful Soup is a good tool to read the data, although I used Selenium to scrape all of the data. This allowed me to experiment by entering data straight from the Jupyter Notebook into the fields of the webpage, as well as explore the advanced options in the Search Engine. I did, however, go back to Beautiful Soup to scrape common words and coding languages, mainly due to the speed and ease of that toolkit.

import requests
import bs4
from bs4 import BeautifulSoupr = requests.get(r'https://www.indeed.co.uk/jobs?q=machine+learning&l=london&sort=date&limit=30&radius=25&start=0')soup = BeautifulSoup(r.text, 'html.parser')job_cards = soup.find_all('div',
                          attrs={'class' : 'jobsearch-SerpJobCard'})## Then to test if the data is being collected correctlyfor job in job_cards:
    try:
        print('Title:',
              job.find('h2',
                            attrs={'class' : 'title'}).text)
    except:
        print('Title2:',
              job.find('h2',
                            attrs={'element' : 'job_title'}))

    try:
        print('company:', job.find('span', attrs={'class' :'company'}).text)except:
        print('no company')
        
    try:
        
        print('location:',job.find('span', attrs={'class' :'location'}).text)
    except:
        print('no location')try:
        print('rating:',job.find('span', attrs={'class' : 'ratingsContent'}).text)
    except:
        print('no rating')
        
    try:
        print('salary:',job.find('span', attrs={'class' : 'salaryText'}).text)
    except:
        print('no salary')
    print('=' *20)

I then switched over to Selenium

from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

I defined the functions used to collect the data using Xpaths.

# Using Selenium, and X-paths result in each individual clickcarddef extract_location(result):
    try:
        location = result.find_element_by_xpath('.//span[contains(@class,"location")]').text
    except:
        location = "None"
    return locationdef extract_salary(result):
    try:
        salary = result.find_element_by_xpath('.//span[@class="salaryText"]').text
    except:
        salary = "None"
    return salarydef extract_job_title(result):
    try:
        title  = result.find_element_by_xpath('.//h2[@class="title"]//a').text
    except:
        title = result.find_element_by_xpath('.//h2[@class="title"]//a').get_attribute(name="title")
    return titledef extract_company(result):
    try:
        company = result.find_element_by_xpath('.//span[@class="company"]').text
    except:
        try:
            company = result.find_element_by_xpath('//*[@id="p_9d31f359e164959f"]/div[1]/div[1]/span[1]/a')
        except:
            company = "None"
    return companydef extract_link(result):
    try:
        link = result.find_element_by_xpath('.//h2[@class="title"]//a').get_attribute(name="href")
    except:
        link = "None"
    return linkdef extract_rating(result):
    try:
        rating = job.find_element_by_xpath('.//span[@class="ratingsContent"]').text
    except:
        rating = "None"
    return rating

The code below allows for any search to be used with any job and location parameters in the UK. It should be easily changeable for use in other regions.

# Chromedriver locationDRIVER_PATH = '/Users/finnalexander/Desktop/chromedriver'# Loading the driverdriver = webdriver.Chrome(executable_path = DRIVER_PATH)# Load the websitedriver.get('https://indeed.co.uk')# Location choose search citysearch_job = input('Enter job search: ')# Location choose search locationsearch_location = input('Enter city : ')find_job = '+'.join(search_job.split(' '))find_location = '+'.join(search_location.split(' '))# Input location & job for searchfinder_location = driver.find_element_by_xpath('//*[@id="text-input-where"]')
finder_location.send_keys(search_location)finder_job = driver.find_element_by_xpath('//*[@id="text-input-what"]')
finder_job.send_keys(search_job)# Click Find Jobsinitial_search_button = driver.find_element_by_xpath(
    '//*[@id="whatWhereFormId"]/div[3]/button')
initial_search_button.click()

This will display the first page of the personalised job search, but you can also refine your search in the advanced search.

# Click Advanced Searchadvanced_serch = driver.find_element_by_xpath('//*[@id="jobsearch"]/table/tbody/tr[3]/td[4]/div/a')
advanced_serch.click()# set display limit of 30 results per pagedisplay_limit = driver.find_element_by_xpath('//*[@id="limit"]//option[@value="30"]')
display_limit.click()# sort jobs by date rather than relevancesort_option = driver.find_element_by_xpath('//*[@id="sort"]//option[@value="date"]')
sort_option.click()# Click Find Jobsfind_jobs = driver.find_element_by_xpath('//*[@id="fj"]')
find_jobs.click()# Click pop up asking for Email addresstry:
    driver.find_element_by_xpath('//*[@id="popover-x"]/button').click()
except:
    print('no need')

Once the data has been collected, there is a certain amount of data cleaning that is required before trying to create a model to predict high or low salary jobs. Some of the variables that required cleaning; converting the target variable into a numerical value to get a true understanding of the median salary etc, as well as removing postcodes (an entertaining and informative stackoverflow on this very subject kept me 😆 for an hour or so). Due to the fact that many job postings don’t have location information, removing postcodes and assigning the City variable mean that we have less localised data, or at least less options for defining location etc.

Once the data is cleaned and the observations that can’t be used to train or test my model have been removed (specifically data with no salary information), it is time to scrape the job descriptions. I stored the link from each job-card onto a dataframe and because I had removed the useless observations, this had reduced the amount of potential website requests by approximately 60%.

At this stage it was very important to put in a time-delay into the scraping using Selenium. Initially, I made the mistake of sending almost 2000 requests in a very short period of time. This caused my scraper to be flagged as a potential robot, and put a premature ending to my first attempt at scraping the data. I ended up implementing a wait-time based on a random number between 3 and 8 seconds before sending a new request. I then managed to collect the data with no further problems.

Exploratory Data Analysis

Now that all the data had been collected, it was important to do a little bit of exploratory data analysis. I looked at the salary distribution as well as some of the categories that were going to be most relevant; Location, Paid_by, City, Review, etc. The first thing I noticed is that the paid_by_day distribution was extremely high compared to other paid_by values. I ended up keeping the day salary information at the imputed values based on working 20 days a month. Arguably all jobs which were being paid by the day could’ve been removed or possibly reduced by the expected work amount. It is, however, not surprising that people getting paid a day rate are on average higher paid than people who are in permanent employment.

The next step was to set a binary classification problem using the median salary as the threshold.

Time to start working on some models 🤓!!!!!

The model that performed best was the Logistic Regression model. With just the location variable, it was already performing with over .65 mean cv scores, which is well above the baseline of 50% . This was to be expected due to the difference in the economies of cities around the UK and location has quite a large bearing on the expected salary income e.g. London has a much higher living cost and the South of England is generally more expensive.

After trying out all of the categorical variables, I noticed the model performance had not improved by much. In order to try and improve on my scores, I implemented Natural Language Processing on the job titles and descriptions. I collected the most used 100 words in the English language (according to Wikipedia) to remove them from my text data as they are unlikely to provide any real insight in high or low salary jobs. I also collected the 50 most used coding languages. My main reason for this was because some languages are short in length/ have characters which may have been skipped over or replaced. Here is the code:

title_splitting = df['Title'].str.split(r"[.\)-,/ (]")def replace_all(list_of_strings):
    wrk_title = []
    for word in list_of_strings:
        if word in coding_languages:
            wrk_title.append(word.upper())
        elif word == 'IT':
            wrk_title.append(word.upper())
        elif word.lower() in common_words:
            break
        elif word.isalpha()==False:
            break
        
        wrk_title.append(word.upper())
    return list(set(wrk_title))df['REC_TITLE_'] = ["|".join(replace_all(title)) for title in title_splitting]title_dummies = df['REC_TITLE_'].str.get_dummies().add_prefix('T_')result = pd.concat([title_dummies, df[['Location', 'City', 'Paid_by', 'Review']]], axis=1, sort=False)

The data was then put through many different models including Decision Tree classifiers. Most of them performed similarly, however Logistic Regression continued to score the best.

X=result
X = pd.get_dummies(X)
y= df[‘Salary’].apply(lambda x: 1 if x>=median_sal else 0)X_train, X_test, y_train, y_test = train_test_split(X, y,
 stratify=y, test_size=0.2, random_state=1)scaler = StandardScaler()X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)model = LogisticRegression(solver=’liblinear’, multi_class=’ovr’)gs_params = {'C': np.logspace(-4, 1, 50),
          'penalty': ['l1', 'l2'],
          'fit_intercept': [True, False]}gs = GridSearchCV(estimator=model,
                  param_grid=gs_params,
                  cv=5,
                  scoring='accuracy',
                  return_train_score=True,
                  n_jobs=2,
                  verbose=2)gs.fit(X_train, y_train)# BEST RESULTSBest Parameters:
{'C': 0.18420699693267165, 'fit_intercept': False, 'penalty': 'l1'}
Best estimator C:
0.18420699693267165
Best estimator mean cross validated training score:
0.7784470911589556
Best estimator score on the full training set:
0.9031165311653117
Best estimator score on the test set:
0.7533875338753387

The confusion matrix shows a pretty even balance between the false positives and the false negatives. By looking at the ROC curve and the Precision-Recall Curve below we can be relatively happy with our model’s performance.

Whilst working on this project I learned how to work with Natural Language Processes using countvectorizers and tfidvectorisers, as well as how to use OneHotEncoder. I tried to implement this on the description data set and ended up using it on the Job-title as well. This Logistic Regression performed slightly better after some fine-tuning of the parameters in the vectoriser compared to the naïve natural language processing I had created before.

This was a little disappointing to not see a significant improvement in the model. It did, however, allow me to infer one vital point about the data set, in that the title is a lot more important than the description. In fact the description had very little impact on my scores at all. Job-Title, Location and all the other categorical variables collected during this project are way more valuable.

With regards to risks and limitations of this model, it would be interesting to see if it would work with other English language jobs posted on different job search sites. The other major limitation was that the data was only collected over a period of a few days in October 2020, and is therefore a relatively small sample size. With more information and more fine-tuning in the description processing, it may be possible to achieve a better model with a higher recall and precision score.

BOOM 🤩

Indeed Job Scraper and Analysis

INDEED JOB SEARCH

How to predict a data related jobs salary in the UK

Exploratory Data Analysis

Written by Finn Alexander