OLS Regression for Pay Gap Analysis

What OLS answers, and what it does not

Pay-gap analysis usually moves from reporting the gap, to estimating the residual, to decomposing what contributes to the difference.

Layer 1

Directive reporting - the actually observed pay gap

Mostly unadjusted descriptive metrics; 9.1(g) narrows the comparison to same/equal-value groups.

Layer 2

OLS regression - estimate the unexplained pay gap

One combined model tests whether gender is still associated with pay after adjusting for pay-related factors. OLS regression is the analysis method used by most model providers.

Layer 3

Oaxaca-Blinder - decompose the pay gap

Male and female models split the gap into a detailed explained and an unexplained component.

An OLS residual is not proof of discrimination, but it is a priority for investigation.
If a variable reduces the gender coefficient, the model has found a statistical explanation. That does not automatically make the explanation fair, objective, or gender-neutral.

The regression model explained

log(pay) = gender + grade + job family + location + tenure + performance + error

The gender coefficient is the main result for adjusted pay gap analysis as it would represent a pay decrease for women.

Observed pay Pay differs

Start with the pay outcome you want to explain.

Controls Compare like for like individuals

Adjust for pay-related factors.

Result -0.041

Gender coefficient: roughly a 4.1% adjusted pay gap.

log(pay)

The pay measure being explained. Log pay makes coefficients easier to read as approximate percentage differences.

gender

The variable of interest. In a common setup, Woman = 1 and Man = 0, so men are the reference group.

controls

Measured pay-related factors such as grade, job family, location, tenure, performance, and working time.

error

The part of pay not explained by the variables in the model.

Data and Excel setup

Start from the data preparation logic: one clean analysis dataset, one row per employee, and documented sources for pay, job, and people fields.

Field	Use in the model
Employee ID	One anonymised row per employee.
Gender	Consistent coding, 0 = Male, 1 = Female.
Pay measure	Base wage or salary as annual, hourly, and adjusted to 1 FTE. The same approach applies to variable and complementary pay.
Log pay	For Excel derived with =LN(pay).
Job data	Grade, job family, employee group, location, level, or job profile.
People factors	Tenure, qualifications, performance, ...

Run the model and read the output

Recommended route for Excel analysis

Analysis ToolPak

Link to Excel tutorial

Enable the Analysis ToolPak.
Open Data > Data Analysis > Regression.
Set Input Y Range to the log-pay column.
Set Input X Range to gender plus all of the control variables.
Tick Labels if the first row contains headers.
Run the model and preserve the output with the dataset version.

Advanced Excel users can also use =LINEST(Y_range, X_range, TRUE, TRUE), but the ToolPak output is easier to read and preserve for a first model.

Read the output as a set of signals. None of them is a verdict on its own, but together they tell you whether the model is usable and where to investigate. The individual outputs can be interpreted as follows:

R Square Model fit

Higher means the model explains more pay variation between genders.

Observations Population check

This should match the expected analysis population. If it does not, fix the dataset before reinterpreting results.

Coefficient Size and direction

The absolute size shows strength of association with pay; the sign shows direction. The gender coefficient is the adjusted gap estimate.

P-value Statistical signal

Lower p-values give stronger evidence that the estimate is not zero.

95% interval Uncertainty

Narrower intervals are easier to interpret. Wide intervals mean the estimate is uncertain.

Interpreting the gender coefficient

In a log-pay model, the gender coefficient is the adjusted pay gap estimate. If Woman = 1 and Man = 0, a negative coefficient means women are paid less than comparable men in the model.

=EXP(coefficient)-1

For -0.041, =EXP(-0.041)-1 gives approximately -4.0%.

The p-value suggests the estimated gender coefficient is unlikely to be zero under the model assumptions. That does not automatically prove discrimination, but it is a visible residual gap that requires explanation and investigation.

Checking the model

After reading the coefficient, stress-test the model before drawing conclusions. The result should not depend on a data artefact, a small-sample problem, or an unstable specification.

Data checks

Population, pay measure, and exclusions are documented.
Pay values are positive and plausible.
Missing values and zero values are separated.
Dummy variables omit one reference category.

Model checks

There are enough employees for the number of variables.
Key categories are not too small to interpret.
Outliers are reviewed before conclusions are drawn.
The gender coefficient does not move sharply when one variable is added or removed.
Identify whether the residual gap is concentrated in particular employee groups.