Data Cleaning Case Study

80,000 Hours — Different formats and conventions

October, 2020

3 MIN READ

You have 80,000 hours in your career. How can you best use them to help solve the world’s most pressing problems?

Introducing 80,000 Hours & their challenge

80,000 Hours is a nonprofit founded in 2011 that helps graduates have more meaningful careers. They do a lot of research to answer the question What is the most impactful career one can have? and advise graduates on how to transition to positions of high social impact.

To study the world’s problems and the graduates with the skills to solve them, 80,000 Hours collects data from many different sources all over the globe, each using their own set of formats, conventions, and abbreviations. Many fields, as common as name, country, state, city, phone number, email, social media handle, etc, didn’t work out-of-the-box with their existing tools. For example, names were sometimes Full Name when they needed to be separate as First Name and Last Name, or locations were expressed in City when they needed to be City, State, Country.

The manual parsing and merging of these data sets was very time-consuming and prone to errors. 80,000 Hours needed an efficient and robust solution to clean, standardize, and merge these data. This solution would ideally also plug right into their existing workflow, which was mostly in Google Sheets at the time.

They are who I turn to anytime I have a tricky data problem, and I highly recommend them!

— Niel Bowerman, AI Policy Specialist at 80,000 Hours

What we did

Together 80,000 Hours and Aja Data Lab described around 21 Transformations and how they were going to be performed using code automation, APIs, and the latest text mining techniques. A few, of those Transformations were:

Standardized locations, cities, states, and countries, using trusted world map and cities databases
Separated full names into first and last names using advanced text mining and 3rd party APIs
Deduplicated records with the same ID or email address
Cleaned faulty characters, spaces, and numbers from values — e.g. letters and dashes from phone numbers
Classified data based on a given condition — e.g. if a field’s value is greater than X, label that record as Y

With 80,000 Hours’ expectations well-documented, Aja Data Lab developed the data processing functions that now implement the necessary Transformations to each column of data in a quick, reliable, and autonomous manner.

Outcomes

Before the agreed delivery date, 80,000 Hours had fully automated their data cleaning and enrichment efforts. We estimate this saved them ~10 hours per week of employee time, or ~$15,000 per month at market salaries, plus an estimated $3,000 per month in wasted efforts from human error and faulty data. As agreed in the design stage, the solution integrated smoothly into 80,000 Hours’ workflow, requiring no training or extra steps from their employees.

80,000 Hours — Different formats and conventions

RELATED SERVICES

EMAIL US

VISIT US

FOLLOW US