Data Science Pathways

As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of resources building and expanding data science knowledge:


Computational Tools

back to top

There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.

Programming Languages

Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.

Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.

R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.

MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.

Julia. A newer programming language designed to meet the needs of mathematical computing.

Version Control

Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.

Git. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.

Code Sharing and Dissemination

Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and convert plain text into a formatted web document.

Visualization

D3.js. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill to use effectively.

Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.

Database Management (for big and little data)

MySQL. An open source relational database management system using SQL.

Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.

MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.


References

Recommendations are indicated with a star ().


back to top
Author Name Topic Year
Computational and Inferencial Thinking Adhikari, Ani and John DeNero Machine Learning Unknown
Basic Probability Theory Ash, Robert B. Probability and Statistics 1970
Git Tutorial Atlassian Programming - Version Control Unknown
Bayesian Reasoning and Machine Learning Barber, David Machine Learning - Bayesian Methods 2012
A First Course in Linear Algebra Beezer, Robert Arnold Mathematics - Linear Algebra 2008
Pattern Recognition Bishop, Christopher Machine Learning 2006
Introduction to Probability Blitzstein, Joseph and Jessica Hwang Probability and Statistics 2019
The Functional Art: An Introduction to Information Graphics and Visualization Cairo, Alberto Visualization - Design 2012
Open Data Science Masters Curriculum Carethell, Clare Data Science as a Field 2015
Git Internals Chacon, Scott Programming - Version Control 2008
A Course in Machine Learning Daumé III, Hal Machine Learning 2015
Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference Davidson-Pilon, Cameron Machine Learning - Bayesian Methods 2015
Mathematics for Machine Learning Deisenroth, Marc Peter, A Aldo Faisal, and Cheng Soon Ong Mathematics 2019
Composing Programs DeNero, John Programming - Python Unknown
OpenIntro Statistics Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel Probability and Statistics 2015
50 Years of Data Science Donoho, David Data Science as a Field 2015
Think Bayes: Bayesian Statistics Made Simple Downey, Allen Machine Learning - Bayesian Methods 2013
Think Complexity: Complexity Science and Computational Modeling Downey, Allen Programming - Python 2012
Think Python: How to Think Like a Computer Scientist Downey, Allen Programming - Python 2015
Think Stats: Probability and Statistics for Programmers Downey, Allen Probability and Statistics 2014
Pattern Classification Duda, Richard O., Peter E. Hart, and David G. Stork Machine Learning 2012
Now You See It: Simple Visualization Techniques for Quantitative Analysis Few, Stephen Visualization - Design 2009
The Elements of Statistical Learning (2nd Edition) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani Machine Learning 2009
Hands-On Machine Learning with Scikit-Learn and TensorFlow Géron, Aurélien Machine Learning 2017
Oracle/SQL Tutorial Gertz, M. Programming - SQL 2000
Deep Learning Goodfellow, Ian, Yoshua Bengio, and Aaron Courville Machine Learning - Deep Learning and Neural Networks 2016
Grinstead and Snell’s Introduction to Probability Grinstead, Charles and James Snell Probability and Statistics 2006
R for Data Science Grolemund, Garrett, and Hadley Wickham Programming - R 2017
Community Calculus Guichard, David Mathematics - Calculus 2016
Calculus 1, 2, and 3. 3rd Edition Hartman, Gregory Mathematics - Calculus 2015
Data Visualization: A Practical Introduction Healy, Kieran Visualization - Design 2018
Linear Algebra Hefferon, Jim Mathematics - Linear Algebra 2006
Causal Inference: What If Hernan, Miguel and James Robins Other 2019
An Introduction to Statistical Learning James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani Machine Learning 2013
Data Science at the Command Line Janssens, Jeroen Other 2019
A Brief Introduction to Neural Networks Kriesel, David Machine Learning - Deep Learning and Neural Networks 2007
Principles and Techniques of Data Science Lau, Sam, Joey Gonzalez, and Deb Nolan Machine Learning Unknown
Linear Algebra and Its Applications Lay, David Mathematics - Linear Algebra 2006
Notes on Diffy Qs: Differential Equations for Engineering Lebl, Jiří Mathematics - Differential Equations 2014
Mining of Massive Datasets Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman Programming - MapReduce 2014
Data-Intesive Text Processing with MapReduce Lin, Jimmy, and Chris Dyer Programming - MapReduce 2010
Information Theory, Inference and Learning Algorithms MacKay, David Probability and Statistics - Information Theory 2003
D3 Tips and Tricks Maclean, Malcom Visualization - D3 2013
Calculus 1, 2, and 3. 2nd Edition Marsden, Jerrold and Alan Weinsten Mathematics - Calculus 1985
Foundations of Machine Learning Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar Machine Learning 2018
Interpretable Machine Learning Molnar, Christoph Machine Learning 2019
Interactive Data Visualization for the Web Murray, Scott Visualization - D3 2013
MySQL Tutorial MySQL Programming - SQL 1997
Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners (version 0.5) Navarro, Daniel Programming - R 2015
Deep Learning Tutorial Ng, Andrew Machine Learning - Deep Learning and Neural Networks Unknown
Neural Networks and Deep Learning Nielsen, Michael Machine Learning - Deep Learning and Neural Networks 2016
The Quest for Artificial Intelligence: A History of Ideas and Achivements Nilsson, Nils Data Science as a Field 2010
The Not So Short Introduction to LATEX 2ε Oetiker, Tobias Programming - Typesetting 2016
The Python Tutorial Python Software Foundation Programming - Python 2017
Python Machine Learning, Second Ediction Raschka, Sebastian Machine Learning 2017
Gaussian Processes for Machine Learning Rasmussen, Carl Edward, and Christopher Williams Probability and Statistics - Gaussian Processes 2006
The Hitchhiker’s Guide to Python Reitz, Kenneth and‎ Tanya Schlusser Programming - Python 2016
A First Course in Probability Ross, Sheldon Probability and Statistics 2014
From Python to Numpy Rougier, Nicholas Programming - Python 2017
Python for Informatics Severance, Charls Programming - Python Unknown
Understanding Machine Learning: From Theory to Algorithms Shalev-Shwartz, Shai and Shai Ben-David Machine Learning 2014
Advanced Data Analysis from an Elementary Point of View Shalizi, Cosma Machine Learning Unknown
Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code Shaw, Zed Programming - Python 2013
RegExr Skinner, Grant Programming - Regular Expressions Unknown
An Interactive Neural Network Playground Smilkov, Daniel and Shan Carter Machine Learning - Deep Learning and Neural Networks Unknown
Calculus: Early Transcendentals. 8th Edition Stewart, James Mathematics - Calculus 2015
The State of Data Science Stitch Data Science as a Field 2016
Calculus: MIT Open Courseware Strang, Gilbert Mathematics - Calculus 1991
Immersive Linear Algebra Ström, Jacob, Kalle Åström, and Tomas Akenine-Möller Mathematics - Linear Algebra 2019
Reinforcement Learning: An Introduction Sutton, Richard, and Andrew Barto Machine Learning - Reinforcement Learning 2018
Data Science University Programs Swanstrom, Ryan Data Science as a Field 2015
Automate the Boring Stuff with Python: Practical Programming for Total Beginners Sweigart, Al Programming - Python 2016
Linear Algebra Done Wrong Treil, Sergei Mathematics - Linear Algebra 2004
Elementary Differential Equations Trench, William Mathematics - Differential Equations 2013
The Visual Display of Quantitative Information Tufte, Edward Visualization - Design 2001
Introduction to Applied Linear Algebra Vandenberghe, L. Mathematics - Linear Algebra 2017
Python Data Science Handbook VanderPlas, Jake Programming - Python 2016
Scipy Lecture Notes Varoquaux et al. Programming - Python 2017
An Introduction to R Venables, W., and D. Smith Programming - R 2017
Advanced R Wickham, Hadley Programming - R Unknown
A Visual Introduction to Machine Learning Yee, Stephanie, and Tony Chu Machine Learning Unknown

Courses with Online Materials

Recommendations are indicated with a star ().

back to top
Instructor Title Designation University Year
A Mathematics Course for Political and Social Researchers Siegel None Duke University 2014
Advanced Machine Learning Adams CS 281 Harvard University 2013
Advanced Machine Learning Gogate CS 7301 University of Texas at Dallas 2017
Advanced Topics in Machine Learning Krause CS 253 California Institute of Technology 2010
Applied Machine Learning Mueller COMS W4995 Columbia 2017
Artificial Intelligence Berenson CS 534 Worchester Polytechnic Institute 2015
Artificial Intelligence Winston OpenCourseware Massachusetts Institute of Technology 2010
Computational Linear Algebra for Coders Thomas None University of San Francisco 2017
Computational Statistics in Python Chan STA 663 Duke University 2015
Computational Statistics in Python Chan STA 663 Duke University 2017
Convolutional Neural Networks for Visual Recognition Li CS 231n Stanford University 2017
Course: computer vision Tomasi COMPSCI 527 Duke University 2019
Course: machine learning, Sebastian Raschka Raschka STAT 479 University of Wisconsin-Madison 2018
Data Mining and Analysis Walther Stats 202 Stanford University 2017
Data Mining and Machine Learning Arnold STAT 365/665 Yale University 2016
Data Science Irizarry CS 109 Harvard University 2014
Deep Unsupervised Learning Abbeel CS 294-158 University of California, Berkeley 2019
Fast.ai NLP Thomas Fast.ai NLP University of San Francisco 2019
Introduction to Artificial Intelligence Klein CS 188 University of California, Berkeley 2014
Introduction to Artificial Intelligence Konidaris CPS 270 Duke University 2016
Introduction to Data Science Lex CS 5963 University of Utah 2016
Introduction to Data Science for Public Policy Chen PPOL 670 Georgetown University 2018
Introduction to Machine Learning Shewchuk CS 189 University of California, Berkeley 2017
Introduction to Machine Learning Srihari CSE 574 University at Buffalo 2017
Introduction to Machine Learning Vishwanathan CS 590 Purdue University 2010
Introduction to Machine Learning and Data Mining Harrington COMP 135 Tufts University 2016
Learning From Data Abu-Mostafa MOOC California Institute of Technology 2010
Machine Learning Dietterich CS 534 Oregon State University 2005
Machine Learning Domingos CSE 446 University of Washington 2014
Machine Learning Domingos CSE 546 University of Washington 2014
Machine Learning Fern CS 534 Oregon State University 2015
Machine Learning Guestrin 10-601 Carnegie Mellon University 2007
Machine Learning Jamieson CSE 546 University of Washington 2017
Machine Learning Kakade CSE 546 University of Washington 2016
Machine Learning Mitchell 10-601 Carnegie Mellon University 2015
Machine Learning Ng CS 229 Stanford University Unknown
Machine Learning Shavlik CS 760 University of Wisconsin 2010
Machine Learning Wang CS 6140 Northeastern University 2017
Machine Learning Weinberger CS 4780 Cornell University 2017
Machine Learning Zisserman C19 Oxford University 2015
Machine Learning and Data Mining Ihler CS 178 University of California, Irvine 2011
Machine Learning for Data Science Paisley COMS W4721 Columbia 2017
Machine Learning for Data Science Salleb-Aouissi COMS 4721 Columbia 2014
Mining Massive Data Sets Leskovec CS 246 Stanford University 2019
Mining of Massive Data Sets Ullman CS 246 Stanford University 2017
Natural Language Processing with Deep learning Manning CS 224n Stanford University 2019
Practical Deep Learning for Coders, v3 Howard Fast.ai v3 Fast.ai 2019
Principles of Machine Learning Bradbury IDS 705 Duke University 2019
Reinforcement Learning Silver COMPM 050 University College London 2015
Statistical Computing for Scientists and Engineers Zabaras None University of Notre Dame 2018
Structure and Interpretation of Computer Programs DeNero CS 61A University of California, Berkeley 2019
Tensorflow for Deep Learning Research Huyen CS 20SI Stanford University 2017
Theories of Deep Learning Donoho STATS 385 Stanford University 2017

Tools

back to top
Name Topic Description
Anaconda Python Distribution Python Distribution for Python with package manager
Arxiv Sanity Preserver Research Searching Time-saving tool for searching arxiv.org
arXiv.org Research Searching Open access scholarly e-prints
Authorea Collaborative Writing Online scientific document collaboration
Bokeh Python Interactive plotting tools
cmder Command Line Console emulator for Windows
Colorgorical Color Palette Generator Online color palette generator
CommonMark Markdown Language Markdown Language
Convnet.js Machine Learning Simple deep learning package for javascript
D3.js Interactive visualization D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but does require significant skill to use effectively
Dillinger.io Markdown Language Markdown editor
Draw.io Graphics Online graphics platform
Explain Shell Command Line Seach command-lines to see the help text that matches each argument
Fabric Python Command line automation tool
Fast.ai Machine Learning The fastai library simplifies training fast and accurate neural nets using modern best practices
Git Version Control Open source distributed version control system - the de facto standard
Github Version Control Web hosting for git repositories
Github Pages Web Publishing Host web pages from Github repositories
Google Colab Compute Colab allows you to write and execute Python code in your browser with free access to GPUs
Google Earth Engine Geospatial data Geospatial data analysis tool
Google Scholar Research Searching Search tool for scholarly literature
Google Scholar Top Publications Publication Rankings Searchable list of conference and journal rankings by field
Google Seedbank Machine Learning Machine learning idea generation tool
Google Style Guide Programming Style guide for Python, R, Shell, HTML, CSS, Javascript, Java, and C++
Guild.ai Experiment tracking Run, track, and compare machine learning experiments
Hackmd.io Markdown Language Markdown editor
Jupyter Notebook Programming This application allows you to create and share documents that contain live code, equations, visualizations and explanatory text
Keras Machine Learning High-level neural network Python library enabling fast experimentation
Latex for PowerPoint Math Editor Use Latex to write and edit math in PowerPoint
Mathcha.io Math Editor Online math editor that is what-you-see-is-what-you-get; also allows latex syntax
Microsoft Academic Research Searching Search tool for scholarly literature
Mode SQL Simple SQL query and visualization tool
Oh my zsh Command Line Command line themes and plugins, better tab completion and more
Open source license guide License A guide to choosing an open source license
OpenAI Gym Reinforcement Learning A toolkit for developing and comparing reinforcement learning algorithms
OpenAI Universe Reinforcement Learning A toolkit for developing and comparing reinforcement learning algorithms, particularly video games
Our World in Data Grapher Visualization Open source tools for data visualization
Overleaf Collaborative Writing Online LaTeX collaboration
Plot.ly for Python Python Interactive plotting tools
PyFormat Python Explanation of formatting in Python
Pytorch Machine Learning Open source deep machine learning framework
Regexer Regular Expressions Interactive regular expression playground
Reinforce JS Machine Learning Reinforcement learning package for Javascript
Rodeo Python A Python integrated development environment
SACRED Experiment tracking Sacred is a tool to help you configure, organize, log and reproduce experiments
Scimago Journal Rankings Publication Rankings Searchable list of conference and journal rankings by field
Scrapy Web Scraping Scrape data from the web
Scrollama Interactive visualization Scrollers for interactive web visualizations
Semantic Sanity Research Searching Time-saving tool for searching arxiv.org
Semantic Scholar Research Searching AI-backed search engine for scientific journals
Snorkel Machine Learning Programmatically building and managing training data
So You Want to Build A Scroller Interactive visualization Scrollers for interactive web visualizations
Stackedit.io Markdown Language Markdown editor
Style Guide for Python Code Python Programming style guide
Tableau Data visualization Graphical user interface-based data visualization tool
Tabula Data Scraping Extract data from tables
Tensorflow Machine Learning Open source deep machine learning framework
Tensorflow JS Machine Learning Deep learning package for Javascript
Tensorflow Playground Neural Networks Interactive neural network playground
Tensorwatch Debugging Deep learning debuggin tool from Microsoft
The Neural Network Zoo Neural Networks A graphical cheat sheet for neural network architectures and acronyms
Tmux Programming Terminal multiplexer
Tuna Python Python profiler and performance analysis tool
Weights and biases Experiment tracking Record metrics, visualize training and share findings
Zotero Reference Management Reference and citation management system for research

Videos

back to top
Author Organization Name Description
Kernel Functions Abu-Mostafa, Yaser Caltech Description of kernel functions and how they are used
Gradient Boosting and Machine Learning Hastie, Treveor H2O.ai Discussion of ensemble learning including random forests and gradient boosting
Natural Language Processing Jurafsky, Dan Stanford Video series on natural language processing (text analysis)
Machine Learning Klein, Dan and Pieter Abbeel Berkeley Artificial Intelligence and Reinforcement Learning lectures
Nuts and Bolts of Applying Deep Learning Ng, Andrew Deep Learning School Andrew Ng speaks on advice for those looking to enter the field of machine learning
Essence of Linear Algebra Sanderson, Grant 3Brown1Blue Video series on a geometric interpretation of linear algebra concepts
Neural Networks Sanderson, Grant 3Brown1Blue Introductory video series on neural networks
Taylor Series Sanderson, Grant 3Brown1Blue Clear description of Taylor Series
Learning to See Welch, Stephen Welch Labs Intuitive, visual explanation of machine learning
Neural Networks Demystified Welch, Stephen Welch Labs Visual introduction to neural networks
Support Vector Machines Winston, Patrick MIT Open Courseware An exceedingly lucid explantion of support vector machines - intuitively and mathematically