Predicting Salaries for Job Postings

Chenxu Wen, chenxu.wen.math@gmail.com

Introduction and Import
1. Data Import
2. Sanity Checks
Descriptive Analysis
Predictive Model
Conclusions and Suggestions

Introduction and Import

In this report, I performed descriptive and predictive analysis on a data set about job postings. The goal is to understand the factors affecting job salaries and to predict future salaries.

Data Import

The job posting datasets are imported into Python. The first five rows of data are shown.

# import libraries
import numpy as np
import pandas as pd 
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd

# import in the training data set
train = pd.read_csv("train_features_2013-03-07.csv", low_memory=False)
# only drop rows that are all NA:
train = train.dropna(how='all')

# read the target variable - salary 
target = pd.read_csv("train_salaries_2013-03-07.csv")

# read the test data
test = pd.read_csv("test_features_2013-03-07.csv")

# take a look at the first 5 rows of train data
train.head()

	jobId	companyId	jobType	degree	major	industry	yearsExperience	milesFromMetropolis
0	JOB1362684407687	COMP37	CFO	MASTERS	MATH	HEALTH	10	83
1	JOB1362684407688	COMP19	CEO	HIGH_SCHOOL	NONE	WEB	3	73
2	JOB1362684407689	COMP52	VICE_PRESIDENT	DOCTORAL	PHYSICS	HEALTH	10	38
3	JOB1362684407690	COMP38	MANAGER	DOCTORAL	CHEMISTRY	AUTO	8	17
4	JOB1362684407691	COMP7	VICE_PRESIDENT	BACHELORS	PHYSICS	FINANCE	8	16

# take a look at the first 5 rows of test data
test.head()

	jobId	companyId	jobType	degree	major	industry	yearsExperience	milesFromMetropolis
0	JOB1362685407687	COMP33	MANAGER	HIGH_SCHOOL	NONE	HEALTH	22	73
1	JOB1362685407688	COMP13	JUNIOR	NONE	NONE	AUTO	20	47
2	JOB1362685407689	COMP10	CTO	MASTERS	BIOLOGY	HEALTH	17	9
3	JOB1362685407690	COMP21	MANAGER	HIGH_SCHOOL	NONE	OIL	14	96
4	JOB1362685407691	COMP36	JUNIOR	DOCTORAL	BIOLOGY	OIL	10	44

# take a look at the first 5 rows of target data
target.head()

	jobId	salary
0	JOB1362684407687	130
1	JOB1362684407688	101
2	JOB1362684407689	137
3	JOB1362684407690	142
4	JOB1362684407691	163

The top 5 rows in the train, test, and target data are shown above. From the table, we can see that there are 8 columns in the data, including job ID, company ID, job type, degree required, major required, industry, years of experience required, and distance from metropolis. The outcome value is saved in the target variable.

Sanity Checks

Once the data is imported, let’s perform a few sanity checks on the data, including:

missingness - are there missing values in the data?
duplicates - are there duplicates in the data?
observation and feature order - do observations and feature appear in the same order?
outliers - are there outliers in both the predictors and the outcome?
categorical levels - do train and test data have the same factor levels?

Missing Data Check

# check the number of entries and data type for each variable
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
jobId                  1000000 non-null object
companyId              1000000 non-null object
jobType                1000000 non-null object
degree                 1000000 non-null object
major                  1000000 non-null object
industry               1000000 non-null object
yearsExperience        1000000 non-null int64
milesFromMetropolis    1000000 non-null int64
dtypes: int64(2), object(6)
memory usage: 68.7+ MB

# check the number of entries and data type for each variable
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
jobId                  1000000 non-null object
companyId              1000000 non-null object
jobType                1000000 non-null object
degree                 1000000 non-null object
major                  1000000 non-null object
industry               1000000 non-null object
yearsExperience        1000000 non-null int64
milesFromMetropolis    1000000 non-null int64
dtypes: int64(2), object(6)
memory usage: 61.0+ MB

# check the number of entries and data type for each variable
target.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
jobId     1000000 non-null object
salary    1000000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.3+ MB

Information on each variable in the train and test data is shown above. From the information, we can see that this is a relatively large data set with 1 million observations, with no missing data.

Duplicates Check

## check duplicates in train data
print('Are all observations in train unique?')
print(len(train.jobId.unique()) == train.shape[0]) 

Are all observations in train unique?
True

## check duplicates in test data
print('Are all observations in test unique?')
print(len(test.jobId.unique()) == test.shape[0]) 

Are all observations in test unique?
True

# now check if there's any overlap between train and test in terms of job postings
print('The number of job IDs in the test dataset that did appear in the train dataset:')
print(len ( set(test['jobId']).intersection( train['jobId'] ) ) )

print('The number of job IDs in the train dataset that did appear in the test dataset:')
print(len ( set(train['jobId']).intersection( test['jobId'] ) ) )

The number of job IDs in the test dataset that did appear in the train dataset:
0
The number of job IDs in the train dataset that did appear in the test dataset:
0

It’s good news that there’s no overlap between train and test, because otherwise there’d be leakage in the model training.

Observation and Feature Order Check

Here let’s check if the observations appear in the same order in the train data and target data. If not, we’ll need to sort the target to match the train dataset.

# let's check if the jobID appear in the same order in train and in target
# if not, we'll need to sort the target to match the train dataset
sum(train.jobId == target.jobId) == train.shape[0]

True

Now, let’s check if the features appear in the same order in the train and test data. If not, we’ll need to sort the test data to match the train data.

sum(train.columns == test.columns) == train.shape[1]

True

Outliers Check

Now we know that there’s no overlap between train and test observations, and that the features appear in the same order in train and test, we can safely concatenate the two DataFrames together for further analysis. Otherwise, we need to use the merge the two datasets.

# for the purpose of less typing, let's concat the train and test data
data = pd.concat([train, test], keys=['train','test'])
# get a quick description of numeric values in the data
data.describe()

	yearsExperience	milesFromMetropolis
count	2.000000e+06	2.000000e+06
mean	1.199724e+01	4.952784e+01
std	7.212785e+00	2.888372e+01
min	0.000000e+00	0.000000e+00
25%	6.000000e+00	2.500000e+01
50%	1.200000e+01	5.000000e+01
75%	1.800000e+01	7.500000e+01
max	2.400000e+01	9.900000e+01

# histogram of years of experience
sns.distplot(data.yearsExperience, kde= False, bins=25)

<matplotlib.axes._subplots.AxesSubplot at 0x160243890>

png

# histogram of miles from metropolis
sns.distplot(data.milesFromMetropolis, kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1600df690>

png

The data on years of experience and mile from metropolis seem reasonable. Both the mean and median years of experience required are about 12 years, with a min of 0 and a max of 24. Both the mean and the median number of miles from metropolis are about 50 miles, with a min of 0 and a max of 99. The values are almost evenly distributed across the range for both variables. Thus there seem to be no apparent outliers.

target.describe()

	salary
count	1000000.000000
mean	116.061818
std	38.717936
min	0.000000
25%	88.000000
50%	114.000000
75%	141.000000
max	301.000000

# count the number of 0s
print('The number of 0s in salary: '+ str( sum(target.salary == 0)))
print('The minimum non-zero salary: ' + str(min(target.salary[target.salary != 0])))

The number of 0s in salary: 5
The minimum non-zero salary: 17

We can see that the target, salary, is relatively normally distributed, a bit positively skewed. Except for 5 zero salary, the rest are well above 0. These 5 may be outliers. Let’s remove them for now.

# remove zero-salary observations
zero_index = target[target.salary == 0].index
train.drop(zero_index, inplace=True)
target.drop(zero_index, inplace=True)

# update data
data = pd.concat([train, test], keys=['train','test'])
# describe the new data
target.describe()

	salary
count	999995.000000
mean	116.062398
std	38.717163
min	17.000000
25%	88.000000
50%	114.000000
75%	141.000000
max	301.000000

data.select_dtypes(include=['object']).describe()

	jobId	companyId	jobType	degree	major	industry
count	1999995	1999995	1999995	1999995	1999995	1999995
unique	1999995	63	8	5	9	7
top	JOB1362685306512	COMP39	SENIOR	HIGH_SCHOOL	NONE	WEB
freq	1	32197	251088	475230	1066421	286217

# plot the distribution of the target
sns.distplot(target.salary)

<matplotlib.axes._subplots.AxesSubplot at 0x162acec10>

png

The description of categorical values shows that we are working with a fairly clean dataset. Of all the jobs posts included, there are only 63 companies, 8 job types, 9 majors, and 7 industries involoved. The jobs are fairly evenly distributed across companies, industry, degree, and job type. The only exception is major, with a lot of jobs not specifying a required major. The outcome variable is relatively normally distributed, slightly positively skewed. Since the skewness is only slight, we’ll leave the outcome variable as it.

Categorical Levels Check

Here let’s check if the categorical variables share the same levels in train and test data. If there are category levels in train but missing in test or vice versa, then we’d need to add a dummy level to test (or train) data to ensure consistent train and test data for the model.

# get categorical variable names
cat_vars = train.select_dtypes(include=['object']).columns
for i in cat_vars:
    print('The variable ' + i + ' has the same levels in the train and the test dataset:')
    # check if the intersection of levels is the same set as in train data
    print(len ( set(train[i]).intersection( test[i] ) ) == len( set(train[i])) )

The variable jobId has the same levels in the train and the test dataset:
False
The variable companyId has the same levels in the train and the test dataset:
True
The variable jobType has the same levels in the train and the test dataset:
True
The variable degree has the same levels in the train and the test dataset:
True
The variable major has the same levels in the train and the test dataset:
True
The variable industry has the same levels in the train and the test dataset:
True

The output above showed that the levels of all categorical variables have consistent levels in the train and the test data, except jobId (which should be the case, in fact we expected no overlap between the two). This is good since now we don’t need to worry about adding extra dummy levels for either train or test data.

Summary

In summary, we observed the following from the sanity checks:

missing data - there are no missing values in the data.
duplicates - there are no duplicates in the data.
observation and feature order - the observations and features appear in the same order in the train, test, and target data.
outliers - there are no apparent outliers in the data, 5 observations with a salary of 0 were removed. We’ll also look for outliders during the exploratory analysis.
categorical levels - the train and test data have the same factor levels.

Descriptive Analysis

In this section, let’s analyze the correlation among the predictors and also between the predictors and the outcome. One nice way to present the correlations is to use a correlation matrix, such as the one that’s provided by the pairplot function in seaborn. The challenge is that this function does not support categorical variables, but we have quite a few categorical predictors. Let’s map the categorical levels to integers by converting them to factors.

# first, for visualization purpose, create numeric representation for categorical variables 
train_cp = train.copy()
for i in train_cp.select_dtypes(include=['object']).columns.values:
    train_cp[i] = pd.factorize(train_cp[i])[0]
# add in the outcome var - salary 
train_cp['salary'] = target.salary
# remove job Id
train_cp.drop(['jobId'],axis=1, inplace=True)

# second, Create a pair grid instance
# since the plot with all the train data took very long to generate
# let's plot on a random subset of the data
grid = sns.PairGrid(data= train_cp.sample(frac= 0.1) )

# Map the plots to the locations
grid = grid.map_offdiag(sns.regplot, marker = '+', x_jitter=.1, y_jitter=0.1)
grid = grid.map_diag(sns.distplot, kde = False)

png

From the plots above, we can see that the data is fairly evenly distributed across the category levels for both categorical and numeric features. The only exception is the major feature, where the NONE level is very frequent.

As to the relationship between the outcome and the features, we have the following observations:

In general, salary is higher for more experience, and for positions closer to metropolis. But there is a wide range of salary for each experience level and distance level. Also there are some notable outliers way above the average for each level.
There are differences in salary across degrees, majors, industries, and job types, but not across companies. Similar to the relationship with numeric features, there is a wide range of salary for each level of the categorical features, maybe except for the JANITOR job type. Also there are some notable outliers way above the average for almost every level.

In the interest of space, I will not show individual bar plots, but we can learn a few of observations that are quite intuitive:

There is no apparent association between degree and salary, except that no-specific major means less money.
Job type matters. The C-suite position makes way more than the rest job types.
Finance, oil, and web are some of the highest-paid industries, while service and education are paid the least.

Predictive Model

In this section, let’s predict salary using penalized linear models as a starting point.

We have 5 categorical features, and a goal to not only build a model, but also to estimate the feature contributions. With scikit-learn, we’d need to one-hot encode these features, which means that the contribution information for each categorical feature would be dispersed into individual variables for each level. To circumvent this issue, let’s use the h2o package, which encodes categorical features as factors and retains the level contributions to the single categorical feature.

import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.

H2O cluster uptime:	1 day 2 hours 16 mins
H2O cluster timezone:	America/Chicago
H2O data parsing timezone:	UTC
H2O cluster version:	3.22.1.6
H2O cluster version age:	22 days
H2O cluster name:	H2O_from_python_mshen_n00a2o
H2O cluster total nodes:	1
H2O cluster free memory:	2.624 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	locked, healthy
H2O connection url:	http://localhost:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
Python version:	2.7.14 final

# convert pandas DataFrame to H2OFrame 
dat = train.merge(target, on = 'jobId')
df = h2o.H2OFrame(dat)
# split the data into train and test
splits = df.split_frame(ratios=[0.8], seed=42)
dtrain = splits[0]
dtest = splits[1]

Parse progress: |█████████████████████████████████████████████████████████| 100%

y = 'salary'
x = list(df.columns)
x.remove(y)  #remove the response
x.remove('jobId') 
# The companyId feature doesn't seem to relate with salary, which comfirms our intuition. 
# Also in production, there's always the possibility of predicting on unseen companies before.
# Therefore, let's remove the companyId feature for now
x.remove('companyId') 

Linear Model

## grid search for best params with cross validation
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
# grid search criteria, in the interest of time, let's set an upper limit to the runtime and max models
gs_criteria = {'strategy': 'RandomDiscrete', 
              'max_models': 100,
              'max_runtime_secs': 6000, 
              'stopping_metric': 'mae',
              'stopping_tolerance': 3,
              'seed': 1234,
              'stopping_rounds': 15}

# linear model parameter space
lm_params = {'alpha': [0.1, 0.5, 0.9], 
             'lambda': [1e-4, 1e-2, 1e0, 1e2, 1e4]}

lm_gs = H2OGridSearch(model = H2OGeneralizedLinearEstimator(nfolds = 5), 
                      hyper_params = lm_params,
                     search_criteria = gs_criteria)
lm_gs.train(x=x,y=y, training_frame=dtrain)
# display model IDs, hyperparameters, and MSE
# lm_gs.show()

# Get the grid results, sorted by validation mae
lm_gs_perf = lm_gs.get_grid(sort_by='mae', decreasing=True)
lm_gs_perf

# extract the top lm model, chosen by validation mae
lm_best = lm_gs.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
lm_best_perf = lm_best.model_performance(dtest)

print('The best test MAE is %.2f' % lm_best_perf.mae())
print('The best test RMSE is %.2f' % lm_best_perf.rmse())
print('The best test R2 is %.2f' % lm_best_perf.r2())

The best test MAE is 15.85
The best test RMSE is 19.62
The best test R2 is 0.75

# get coef
coef = lm_best.coef()
# plot feature importance
coef_df = pd.DataFrame({'feature':coef.keys(), 'coefficient':coef.values()})

coef_df.reindex(coef_df.coefficient.abs().sort_values(ascending = False).index)

	coefficient	feature
12	115.651554	Intercept
17	-34.583555	jobType.JANITOR
8	27.669385	jobType.CEO
15	-22.043900	jobType.JUNIOR
16	17.929292	jobType.CFO
28	17.925786	jobType.CTO
2	-16.191104	industry.EDUCATION
24	15.031780	industry.OIL
14	14.869542	industry.FINANCE
23	-12.049542	jobType.SENIOR
19	-11.159391	industry.SERVICE
25	10.121587	degree.DOCTORAL
21	-9.430685	degree.NONE
3	8.327428	major.ENGINEERING
30	7.794749	jobType.VICE_PRESIDENT
1	-7.222763	major.NONE
26	-6.227290	industry.AUTO
11	-5.954474	major.LITERATURE
29	5.882962	industry.WEB
27	-5.747840	degree.HIGH_SCHOOL
5	5.334937	major.BUSINESS
13	5.005238	degree.MASTERS
6	2.846280	major.MATH
9	-2.329465	major.BIOLOGY
22	-2.145721	jobType.MANAGER
18	2.009992	yearsExperience
10	1.619933	major.COMPSCI
7	-1.268621	major.CHEMISTRY
20	-0.399243	milesFromMetropolis
31	0.000000	major.PHYSICS
4	0.000000	degree.BACHELORS
0	0.000000	industry.HEALTH

Random Forest Model

# now let's use the random forest algorithm to see if we can improve the performance
# import 
from h2o.estimators.random_forest import H2ORandomForestEstimator
# rf parameters to be tuned
rf_params = {"ntrees" : [50,100], 
              "max_depth" : [5,10], 
              "sample_rate" : [0.3, 0.7], 
              "mtries":[-1, 3, 7],
              "min_rows" : [3,5] }
# parameters for rf that don't need to be tuned
rf_p = {"seed": 42 , 
         "score_tree_interval": 3,
         "nfolds": 5}
# set up the grid
rf_gs = H2OGridSearch(model = H2ORandomForestEstimator(**rf_p), 
                       hyper_params = rf_params,
                       search_criteria = gs_criteria)
# train the grid
rf_gs.train(x=x,y=y, training_frame=df)

# display model IDs, hyperparameters, and MSE
# rf_gs.show()

# Get the grid results, sorted by validation mae
rf_gs_perf = rf_gs.get_grid(sort_by='mae', decreasing=True)
rf_gs_perf

# extract the top lm model, chosen by validation mae
rf_best = rf_gs.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
rf_best_perf = rf_best.model_performance(dtest)

print('The best test MAE is %.2f' % rf_best_perf.mae())
print('The best test RMSE is %.2f' % rf_best_perf.rmse())
print('The best test R2 is %.2f' % rf_best_perf.r2())

The best test MAE is 15.69
The best test RMSE is 19.44
The best test R2 is 0.75

# plot feature importance
rf_best.varimp_plot()

png

Output Export

# write output
# convert test data to h2o frame
test_h2o = h2o.H2OFrame(test)
# determine the best model based on mae and use it for prediction
if rf_best_perf.mae() < lm_best_perf.mae():
    pred = rf_best.predict(test_h2o)
else:
    pred = lm_best.predict(test_h2o)

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%

# examine the basic stats of the salary predictions
pred.describe()

Rows:1000000
Cols:1

	predict
type	real
mins	29.2754525948
mean	116.050909815
maxs	216.278178101
sigma	32.3859060281
zeros	0
missing	0
0	106.993746109
1	100.914067307
2	170.002967072
3	108.953849945
4	111.065993118
5	149.859150085
6	96.3085207367
7	120.059133606
8	104.54789505
9	102.726793747

# concat jobId and salary prediction
out = test_h2o['jobId'].concat(pred)
# change names
out.set_names(list(target.columns))
# write the test output to a csv file
h2o.export_file(out, path = "test_salaries.csv")    

Export File progress: |███████████████████████████████████████████████████| 100%

Conclusions and Suggestions

In data exploration, we can observe that this dataset is quite clean and evenly distributed across levels, with only 5 0-salary observations.
To predict the salary outcome, I built both a regularized linear model and a random forest model using the h2o package. The random forest model performed slightly better but took much longer to tune.
Future directions - prediction explanations: If the rationale behind individual predictions is of interest, we can use the LIME method. LIME stands for Local Interpretable Model-Agnostic Explanations. Essentially, we attempt to understand why a model would make a certain prediction by building a locally-weighted linear model to approximate local predictions. Another viable approach is the SHAP values, which uses game theory to explain the contributions of each features to both individual and the overall predictions.

Share on

Twitter Facebook Google+ LinkedIn

Chenxu Wen 文晨旭