Summary of my personal advice regarding data science and data engineering etc

Intro

I ended up getting many similar questions and giving similar advice to students or junior people in my 1-1s regarding data science or engineering. So I thought I should collect them here for future reference and in case someone forgot to take notes on something they cared about.

Please feel free to critique this advice! I would love to learn if I am wrong or misguided or missed something obvious.

Disclaimer

These are just my experience and opinions, ask other people and you will get other answers (and you should! ask more people to get a better overview! and read lots articles etc). This is not a complete list of important things.

I have only worked in the private sector in non-EA areas. I have not worked in academia, public sector, non-profit or an EA org.

General advice applicable to many roles

  • Learn what you think is important in a job, besides impact and salary there are other aspects such as work environment, freedom/​autonomy, being able to work from home, benefits, vacation etc. Reading 80000hours.org is helpful.

    • If you don’t work in an impactful field you should make sure to get a proper salary—there is no reason to leave money on the table.

      • Look into negotiating—read articles, books (e.g. Never Split the Difference—a very American negotiation book, but I’m no expert there are probably better books), talk to senior people.

      • Learn what other people with similar experience and roles earn, take things such as which city it is, if they have other compensation or are consultants or freelancing etc. You can learn this by asking people. You can look up statistics but keep in mind that the average or median will include people who do not know how to negotiate.

  • Learn what problem you are trying to solve /​ who are your users or your audience

    • If you solve the wrong problem in the best case waste a lot of time but you might deliver a worse or incorrect solution with possibly bad consequences

    • Who are the users/​audience? Is it developers /​ data scientists /​ the public /​ a certain customer segment (e.g.young people interested in music) /​ the CEO /​ business analysts /​ etc. ?

    • If you know who the user/​audience is—try to find out what they need to solve

    • Ask them about the problem and LISTEN. Do not tell them the solution, ask them questions to discover the problem space.

      • For example: Someone needs to get a list of the profit for a certain products each month. Why do they need this? Do they need this to solve some other problem? If they are going to use this to do futher calculations, can these be done beforehand to save them more time? Could you create a dashboard to avoid having to send them the data every month? Is the data already available in some other report or dashboard?

        • Let’s say you find out they need this data to calculate the profit for certain product category for every city the company operates in. Could you not just calculate that directly? Why not?

      • NOTE: I’m not very satified with the example above, I might improve it later. Hopefully it gives an idea of what I mean.

    • How would you feel using your product? How easy is it to understand the product/​analysis/​demo/​presentation/​email/​etc without your inside knowledge and without your expertise?

    • Reading about active listening and design thinking is a good idea

    • Try to make communicate effectively, notice if people misunderstand you, think about why and try to improve.

  • Domain vs tech knowledge:
    As a programmer/​software engineer/​data scientist /​ data engineer you can be very productive without know much about the domain (e.g. retail, finance, biorisk, global health etc) but then it becomes even more important to listen to the users/​stakeholders/​business/​experts. If you have domain knowledge you will be even more productive and accurate.
    It very much depends on how complex the problems are and how tied they are to the domain (e.g. a website can be filled with content written by someone else with the knowledge, so you can create the “shell” and fill it with content, but creating a banking database requires that you know more about the domain so that you model the data correctly)

Useful skills for data scientists

Not a complete list, kind of ranked first to last:

  • Python

    • pandas

    • ML: scikitlearn /​ pytorch /​ tensorflow

    • know some visualization tools, e.g. seabron/​pyplot/​plotly/​matplotlib

  • statistics & probability (obviously)

  • machine learning skills—but at many orgs advanced knowledge is not needed

  • jupyter labs/​jupyter notebooks

  • Know about common types of analysis (very business centered list below), e.g.

    • Churn detection

    • A/​B testing

    • KPIs

  • SQL—this is used for almost every database, or version of it, and it has inspired many frameworks etc

  • chatGPT /​ phind.com

  • git—learn the core concepts such as staging, merging, rebasing, tags, origin, commiting, reverting, comparing versions, stashing.

  • Know about BI (Business Intelligence) tools, e.g. PowerBI, Tableu, Looker, but there are literally hundreds of other products...

  • Know how to use one of the big cloud providers (Azure, AWS, GCP) - but you do not need to be an expert at all

  • Docker—learn the basics, no need to be an expert

  • Learn what Continuous Integration /​ Continuous Deployment (CI/​CD) - just know the concept

  • Book tip: How to measure anything. It is at bit old so it lacks machine learning and A/​B testing etc but it gives a solid understanding of how to think about how to solve problems for organizations using data and statistics. If you read this then you should compliment this book with new material such as blog posts. You also keep in mind that this book does not really adequately address problems such as https://​​en.wikipedia.org/​​wiki/​​Goodhart’s_law

What is a data engineer?

Data engineering is basically bringing software engineering tools and practices into the “data world”.

However is a new area so being a data engineer at different companies can be very different. In some places is like being a software engineer, in other places it is using no-code tools to create data pipelins. In my case it is building data tools using software engineering skills or deciding which ready-made tools/​libreries/​frameworks to add to the platform and helping the people doing the analysis learn to use the data platform. This includes writing documentation, teaching people, trying out new technology to solve problem. Creating data pipelines in the tools we have chosen/​built etc.

There is also the concept of MLOps (Machine Learning Operations) which overlaps with Data Engineering a lot but you are a lot more focused on making ML systems work. Thus you need more ML knowledge. Terms such as model drift, MLOps platforms (such as kubeflow) and feature stores will become more important.

Who becomes a data engineer?

The majority is probably software developers interested in data science and data scientists interested in software development. But of course there are other paths.

Useful skills for data engineers

Not a complete list, kind of ranked first to last:

  • Python or Java: Python is better to know but knowing both is even better

  • Know the basics of data science /​ machine learning libraries

  • git—learn the core concepts such as staging, merging, rebasing, tags, origin, commiting, reverting, comparing versions, stashing.

  • Concepts to understand:

    • Data pipelines

    • ETL

    • DevOps

    • Data Warehouse

    • Data Lake

    • Kafka /​ log based message brokers

    • KPIs

    • Event driven systems

    • Orchestration and scheduling

    • Continuous Integration /​ Continuous Deployment

    • Relational databases

    • Document Databases

    • Data Catalog

  • chatGPT /​ phind.com

  • SQL—this is used for almost every database, or version of it, and it has inspired many frameworks etc

  • Know how to use one of the big cloud providers (Azure, AWS, GCP)

  • Know how to create and use a REST API

  • Know about BI (Business Intelligence) tools, e.g. PowerBI, Tableu, Looker, but there are literally hundreds of other products...

  • New buzzwords to read about but they can stil be a bit unclear:

    • Data Mesh—including data as a product and data products

    • Data monitoring

    • Data observability

    • Reverse-ETL

  • jupyter labs/​jupyter notebooks—learn the basics

  • Kubernetes—basics if you need to at your org, you might need to learned advanced skills, it depends on how your org is structurex

  • Docker—learn at least the basics, but don’t need to be an expert

  • Terraform—know at least the basics

Random useful resources:

Book tip for software engineering: The Pragmatic Programmer—contains lots of general advice on how to work, solve problem, code and design systems. Useful for anyone who codes but aimed at software engineers.

Favorite EA Books: Doing good better & The Precipice

Spaced repetition and flash cards:

Podcasts

  • EA:

    • 80000hours

    • Hear This Idea

    • Future of Life Institute Podcast

  • Misc: Clearer Thinking podcast

  • Data:

    • Dataskeptic

    • Python.__init__

  • Python

    • Talk Python to Me—many data science related episodes

    • Python Bytes

    • Python.__init__

No comments.