
Overview
As a Data Science Fellow at Insight, New York, I consulted for a tech company who receives more than 12,000 job applications per year across several different domains (e.g., Data Science, Data Engineering, etc.). Like many companies, my client faces the problem of receiving a large volume of job applications, resulting in significant cost in terms of time spent processing the applications manually, as well as a slow response rate to applicants of up to several weeks. I used natural language processing (NLP) and targeted feature engineering to develop a machine learning model that scores and ranks applications, helping my client to fast track top applicants to the interview stage. My product saves my client on average 12,000 h/year of manual processing time, as well as increasing their response rate from weeks to days.
Data Cleaning and Feature Engineering
I obtained 3,600 job applications that were labeled as either No (the application was rejected during the first round of review) or Yes+ (the application was passed on to the next stage). What made this task challenging was that applications consisted almost entirely of unstructured text; namely, the applicants’ answers to a variety of application questions in which they provided brief descriptions of their education and professional background, domain knowledge, industry specific skills and motivations, etc.
To extract meaningful information from the unstructured (text-based) applications, I engineered features that were based on expert domain knowledge (i.e., from in-depth discussions with people that have first-hand experience in the hiring process). I considered various aspects of successful applications, such as the presence of specific keywords related to an applicant’s relevant skills (e.g., machine learning, statistics), knowledge of relevant tools (e.g., programming languages), education level and study area, as well as the length of responses to the application questions.

Logistic Regression Models to Score Applications
I employed logistic regression models to predict the suitability of job applications for five different tech domains of my client company: Data Science, Health Data Science, Data Engineering, Artificial Intelligence, and DevOps. Logistic regressions are ideal for this task, because they are simple and easy to interpret, and – as parametric models – suitable for making predictions on new data. In addition, logistic regressions model the probability of classes (here No and Yes+), rather than strictly classifying data into binary classes. These classification probabilities can be directly interpreted as application scores and used to rank applicants.
For each sector, I started with a full model (all potentially relevant features), and reduced the feature space using univariate feature selection.

Validation
To assess model performance, I calculated the area under the Receiver Operating Characteristics (ROC) curve, or AUC. Unlike other commonly used validation metrics such as accuracy or F1 score, AUC is threshold independent (i.e., it is suitable for continuous classification probabilities rather than binary classifications), while also being appropriate for balanced classes as is the case here.
To validate my models, I compared model predictions for the test data sets (i.e., the 30% of the applications that were not used to train the model) to the actual labels in the test data (i.e., No and Yes+). Visualizing model performances this way allowed me quickly assess the usefulness of the models’ application scores in classifying applications into No and Yes+ categories along a continuous range of classification thresholds.

To my client, I delivered a module consisting of the application processing pipeline and the domain models, as well as a user friendly interactive platform that allows hiring managers at various departments (i.e., locations) to tailor ApplicantScore to their requirements and preferences. In the example below, a user at the BOS location (drop down menu) specifies a classification threshold of 0.7 (slider). At this threshold, roughly 30% of applications will be fast-tracked, of which almost 95% would have been forwarded if applications were processed manually.
An Interactive Tool to Help my Client Fast Track High Quality Applications

More details to this project and code (public version) can be found here.