Please Stop Asking Data Scientists to Write Production Code

Seth Clark
3 min readFeb 19, 2021

Few people need further convincing that data science is hard. The well-known and oft quoted Sean J. Taylor says it well:

One more reason why data science (and most other STEM fields) is tough to master

For the sake of argument, however, lets try creating a list of what you might need to master in order to become a world-class data scientist:

  • Mathy stuff: Linear algebra, basic probability & statistics, bayesian statistics, calculus, maybe discrete math, likely graph theory (at least at some point)
  • Programming expertise: Python, R, possibly some Julia, perhaps Scala, maybe C or C++ for embedded systems, probably SQL, and possibly Matlab (Side note: How did software named for a portmanteau of “Matrix Laboratory” not become the default way to build deep neural networks?)
  • Libraries and frameworks: Pandas, SciKitLearn, NumPy, MatPlotLib, TensorFlow, Keras, CNTK, PyTorch, OpenCV, CUDA, Spark (we could go on forever here…)
  • Software potpourri: Data wrangling, bash scripting, linux prowess, exceptional debugging skills, strong Google-fu

This list is starting to get pretty long, but wait there’s more…

  • Hardware hacking: Can run models on CPUs, GPUs, and maybe even FPGAs, good at managing dozens of poorly-documented dependencies for multi-year projects, able to install and run deprecated versions of OpenCV and Tensorflow on a hilariously small GPU for a demo your boss asked for (I’m not proud to say that I was that boss :-/ )
  • Intangibles: Able to speak “Executive”, a good sense of which problems are solvable, comfortable with open-ended problems

After all of that, it’s almost comical to ask data scientists to turn their amazing creations into production-caliber services. To build, deploy, and manage high-availability services you really need a strong grasp on containerization, microservices, Kubernetes, build servers, integration testing, API monitoring, and a whole lot more. Becoming an expert in these technologies takes a similar amount of time as mastering the skills required to become a data scientist. It’s a bit unfair to ask data scientists to do both things well. It would be like expecting an architect to don a hardhat, hop into the nearest crane, and start swinging beams around for the building she just got done designing.

Photo by Artem Labunsky on Unsplash

So please, do your friendly, neighborhood data scientist a favor and stop asking them to build production code (either directly, or indirectly). Allowing them to focus their energy and excitement on data science will do wonders for your entire team, project, or product. Here’s a few ideas for how you can help:

  1. Hire some machine learning engineers to help make the models your team develops both scalable and efficient
  2. Take advantage of CI/CD pipelines to increase automation, improve quality, and speed up delivery
  3. Invest in tools and software that will automatically deploy and monitor machine learning models so that data scientists can get back to myriad of other things they do

One of the reasons why I helped start modzy.com was because I saw how hard it is for data scientists and developers to effectively collaborate on Machine Learning projects. I hope these ideas help you and your team find new and more productive ways to work together.

--

--

Seth Clark

Co-founder and Head of Product at Modzy, product enthusiast, and serial hobbyist.