As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of resources building and expanding data science knowledge:
There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.
Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.
Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.
R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.
MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.
Julia. A newer programming language designed to meet the needs of mathematical computing.
Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.
Git. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.
Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and convert plain text into a formatted web document.
D3.js. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill to use effectively.
Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.
MySQL. An open source relational database management system using SQL.
Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.
MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.
Recommendations are indicated with a star ().
Author | Name | Topic | Year | |
---|---|---|---|---|
Computational and Inferencial Thinking | Adhikari, Ani and John DeNero | Machine Learning | Unknown | |
Basic Probability Theory | Ash, Robert B. | Probability and Statistics | 1970 | |
Git Tutorial | Atlassian | Programming - Version Control | Unknown | |
Bayesian Reasoning and Machine Learning | Barber, David | Machine Learning - Bayesian Methods | 2012 | |
A First Course in Linear Algebra | Beezer, Robert Arnold | Mathematics - Linear Algebra | 2008 | |
Pattern Recognition | Bishop, Christopher | Machine Learning | 2006 | |
Introduction to Probability | Blitzstein, Joseph and Jessica Hwang | Probability and Statistics | 2019 | |
The Functional Art: An Introduction to Information Graphics and Visualization | Cairo, Alberto | Visualization - Design | 2012 | |
Open Data Science Masters Curriculum | Carethell, Clare | Data Science as a Field | 2015 | |
Git Internals | Chacon, Scott | Programming - Version Control | 2008 | |
A Course in Machine Learning | Daumé III, Hal | Machine Learning | 2015 | |
Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference | Davidson-Pilon, Cameron | Machine Learning - Bayesian Methods | 2015 | |
Mathematics for Machine Learning | Deisenroth, Marc Peter, A Aldo Faisal, and Cheng Soon Ong | Mathematics | 2019 | |
Composing Programs | DeNero, John | Programming - Python | Unknown | |
OpenIntro Statistics | Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel | Probability and Statistics | 2015 | |
50 Years of Data Science | Donoho, David | Data Science as a Field | 2015 | |
Think Bayes: Bayesian Statistics Made Simple | Downey, Allen | Machine Learning - Bayesian Methods | 2013 | |
Think Complexity: Complexity Science and Computational Modeling | Downey, Allen | Programming - Python | 2012 | |
Think Python: How to Think Like a Computer Scientist | Downey, Allen | Programming - Python | 2015 | |
Think Stats: Probability and Statistics for Programmers | Downey, Allen | Probability and Statistics | 2014 | |
Pattern Classification | Duda, Richard O., Peter E. Hart, and David G. Stork | Machine Learning | 2012 | |
Now You See It: Simple Visualization Techniques for Quantitative Analysis | Few, Stephen | Visualization - Design | 2009 | |
The Elements of Statistical Learning (2nd Edition) | Friedman, Jerome, Trevor Hastie, and Robert Tibshirani | Machine Learning | 2009 | |
Hands-On Machine Learning with Scikit-Learn and TensorFlow | Géron, Aurélien | Machine Learning | 2017 | |
Oracle/SQL Tutorial | Gertz, M. | Programming - SQL | 2000 | |
Deep Learning | Goodfellow, Ian, Yoshua Bengio, and Aaron Courville | Machine Learning - Deep Learning and Neural Networks | 2016 | |
Grinstead and Snell’s Introduction to Probability | Grinstead, Charles and James Snell | Probability and Statistics | 2006 | |
R for Data Science | Grolemund, Garrett, and Hadley Wickham | Programming - R | 2017 | |
Community Calculus | Guichard, David | Mathematics - Calculus | 2016 | |
Calculus 1, 2, and 3. 3rd Edition | Hartman, Gregory | Mathematics - Calculus | 2015 | |
Data Visualization: A Practical Introduction | Healy, Kieran | Visualization - Design | 2018 | |
Linear Algebra | Hefferon, Jim | Mathematics - Linear Algebra | 2006 | |
Causal Inference: What If | Hernan, Miguel and James Robins | Other | 2019 | |
An Introduction to Statistical Learning | James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani | Machine Learning | 2013 | |
Data Science at the Command Line | Janssens, Jeroen | Other | 2019 | |
A Brief Introduction to Neural Networks | Kriesel, David | Machine Learning - Deep Learning and Neural Networks | 2007 | |
Principles and Techniques of Data Science | Lau, Sam, Joey Gonzalez, and Deb Nolan | Machine Learning | Unknown | |
Linear Algebra and Its Applications | Lay, David | Mathematics - Linear Algebra | 2006 | |
Notes on Diffy Qs: Differential Equations for Engineering | Lebl, Jiří | Mathematics - Differential Equations | 2014 | |
Mining of Massive Datasets | Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman | Programming - MapReduce | 2014 | |
Data-Intesive Text Processing with MapReduce | Lin, Jimmy, and Chris Dyer | Programming - MapReduce | 2010 | |
Information Theory, Inference and Learning Algorithms | MacKay, David | Probability and Statistics - Information Theory | 2003 | |
D3 Tips and Tricks | Maclean, Malcom | Visualization - D3 | 2013 | |
Calculus 1, 2, and 3. 2nd Edition | Marsden, Jerrold and Alan Weinsten | Mathematics - Calculus | 1985 | |
Foundations of Machine Learning | Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar | Machine Learning | 2018 | |
Interpretable Machine Learning | Molnar, Christoph | Machine Learning | 2019 | |
Interactive Data Visualization for the Web | Murray, Scott | Visualization - D3 | 2013 | |
MySQL Tutorial | MySQL | Programming - SQL | 1997 | |
Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners (version 0.5) | Navarro, Daniel | Programming - R | 2015 | |
Deep Learning Tutorial | Ng, Andrew | Machine Learning - Deep Learning and Neural Networks | Unknown | |
Neural Networks and Deep Learning | Nielsen, Michael | Machine Learning - Deep Learning and Neural Networks | 2016 | |
The Quest for Artificial Intelligence: A History of Ideas and Achivements | Nilsson, Nils | Data Science as a Field | 2010 | |
The Not So Short Introduction to LATEX 2ε | Oetiker, Tobias | Programming - Typesetting | 2016 | |
The Python Tutorial | Python Software Foundation | Programming - Python | 2017 | |
Python Machine Learning, Second Ediction | Raschka, Sebastian | Machine Learning | 2017 | |
Gaussian Processes for Machine Learning | Rasmussen, Carl Edward, and Christopher Williams | Probability and Statistics - Gaussian Processes | 2006 | |
The Hitchhiker’s Guide to Python | Reitz, Kenneth and Tanya Schlusser | Programming - Python | 2016 | |
A First Course in Probability | Ross, Sheldon | Probability and Statistics | 2014 | |
From Python to Numpy | Rougier, Nicholas | Programming - Python | 2017 | |
Python for Informatics | Severance, Charls | Programming - Python | Unknown | |
Understanding Machine Learning: From Theory to Algorithms | Shalev-Shwartz, Shai and Shai Ben-David | Machine Learning | 2014 | |
Advanced Data Analysis from an Elementary Point of View | Shalizi, Cosma | Machine Learning | Unknown | |
Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code | Shaw, Zed | Programming - Python | 2013 | |
RegExr | Skinner, Grant | Programming - Regular Expressions | Unknown | |
An Interactive Neural Network Playground | Smilkov, Daniel and Shan Carter | Machine Learning - Deep Learning and Neural Networks | Unknown | |
Calculus: Early Transcendentals. 8th Edition | Stewart, James | Mathematics - Calculus | 2015 | |
The State of Data Science | Stitch | Data Science as a Field | 2016 | |
Calculus: MIT Open Courseware | Strang, Gilbert | Mathematics - Calculus | 1991 | |
Immersive Linear Algebra | Ström, Jacob, Kalle Åström, and Tomas Akenine-Möller | Mathematics - Linear Algebra | 2019 | |
Reinforcement Learning: An Introduction | Sutton, Richard, and Andrew Barto | Machine Learning - Reinforcement Learning | 2018 | |
Data Science University Programs | Swanstrom, Ryan | Data Science as a Field | 2015 | |
Automate the Boring Stuff with Python: Practical Programming for Total Beginners | Sweigart, Al | Programming - Python | 2016 | |
Linear Algebra Done Wrong | Treil, Sergei | Mathematics - Linear Algebra | 2004 | |
Elementary Differential Equations | Trench, William | Mathematics - Differential Equations | 2013 | |
The Visual Display of Quantitative Information | Tufte, Edward | Visualization - Design | 2001 | |
Introduction to Applied Linear Algebra | Vandenberghe, L. | Mathematics - Linear Algebra | 2017 | |
Python Data Science Handbook | VanderPlas, Jake | Programming - Python | 2016 | |
Scipy Lecture Notes | Varoquaux et al. | Programming - Python | 2017 | |
An Introduction to R | Venables, W., and D. Smith | Programming - R | 2017 | |
Advanced R | Wickham, Hadley | Programming - R | Unknown | |
A Visual Introduction to Machine Learning | Yee, Stephanie, and Tony Chu | Machine Learning | Unknown |
Recommendations are indicated with a star ().
Instructor | Title | Designation | University | Year | |
---|---|---|---|---|---|
A Mathematics Course for Political and Social Researchers | Siegel | None | Duke University | 2014 | |
Advanced Machine Learning | Adams | CS 281 | Harvard University | 2013 | |
Advanced Machine Learning | Gogate | CS 7301 | University of Texas at Dallas | 2017 | |
Advanced Topics in Machine Learning | Krause | CS 253 | California Institute of Technology | 2010 | |
Applied Machine Learning | Mueller | COMS W4995 | Columbia | 2017 | |
Artificial Intelligence | Berenson | CS 534 | Worchester Polytechnic Institute | 2015 | |
Artificial Intelligence | Winston | OpenCourseware | Massachusetts Institute of Technology | 2010 | |
Computational Linear Algebra for Coders | Thomas | None | University of San Francisco | 2017 | |
Computational Statistics in Python | Chan | STA 663 | Duke University | 2015 | |
Computational Statistics in Python | Chan | STA 663 | Duke University | 2017 | |
Convolutional Neural Networks for Visual Recognition | Li | CS 231n | Stanford University | 2017 | |
Course: computer vision | Tomasi | COMPSCI 527 | Duke University | 2019 | |
Course: machine learning, Sebastian Raschka | Raschka | STAT 479 | University of Wisconsin-Madison | 2018 | |
Data Mining and Analysis | Walther | Stats 202 | Stanford University | 2017 | |
Data Mining and Machine Learning | Arnold | STAT 365/665 | Yale University | 2016 | |
Data Science | Irizarry | CS 109 | Harvard University | 2014 | |
Deep Unsupervised Learning | Abbeel | CS 294-158 | University of California, Berkeley | 2019 | |
Fast.ai NLP | Thomas | Fast.ai NLP | University of San Francisco | 2019 | |
Introduction to Artificial Intelligence | Klein | CS 188 | University of California, Berkeley | 2014 | |
Introduction to Artificial Intelligence | Konidaris | CPS 270 | Duke University | 2016 | |
Introduction to Data Science | Lex | CS 5963 | University of Utah | 2016 | |
Introduction to Data Science for Public Policy | Chen | PPOL 670 | Georgetown University | 2018 | |
Introduction to Machine Learning | Shewchuk | CS 189 | University of California, Berkeley | 2017 | |
Introduction to Machine Learning | Srihari | CSE 574 | University at Buffalo | 2017 | |
Introduction to Machine Learning | Vishwanathan | CS 590 | Purdue University | 2010 | |
Introduction to Machine Learning and Data Mining | Harrington | COMP 135 | Tufts University | 2016 | |
Learning From Data | Abu-Mostafa | MOOC | California Institute of Technology | 2010 | |
Machine Learning | Dietterich | CS 534 | Oregon State University | 2005 | |
Machine Learning | Domingos | CSE 446 | University of Washington | 2014 | |
Machine Learning | Domingos | CSE 546 | University of Washington | 2014 | |
Machine Learning | Fern | CS 534 | Oregon State University | 2015 | |
Machine Learning | Guestrin | 10-601 | Carnegie Mellon University | 2007 | |
Machine Learning | Jamieson | CSE 546 | University of Washington | 2017 | |
Machine Learning | Kakade | CSE 546 | University of Washington | 2016 | |
Machine Learning | Mitchell | 10-601 | Carnegie Mellon University | 2015 | |
Machine Learning | Ng | CS 229 | Stanford University | Unknown | |
Machine Learning | Shavlik | CS 760 | University of Wisconsin | 2010 | |
Machine Learning | Wang | CS 6140 | Northeastern University | 2017 | |
Machine Learning | Weinberger | CS 4780 | Cornell University | 2017 | |
Machine Learning | Zisserman | C19 | Oxford University | 2015 | |
Machine Learning and Data Mining | Ihler | CS 178 | University of California, Irvine | 2011 | |
Machine Learning for Data Science | Paisley | COMS W4721 | Columbia | 2017 | |
Machine Learning for Data Science | Salleb-Aouissi | COMS 4721 | Columbia | 2014 | |
Mining Massive Data Sets | Leskovec | CS 246 | Stanford University | 2019 | |
Mining of Massive Data Sets | Ullman | CS 246 | Stanford University | 2017 | |
Natural Language Processing with Deep learning | Manning | CS 224n | Stanford University | 2019 | |
Practical Deep Learning for Coders, v3 | Howard | Fast.ai v3 | Fast.ai | 2019 | |
Principles of Machine Learning | Bradbury | IDS 705 | Duke University | 2019 | |
Reinforcement Learning | Silver | COMPM 050 | University College London | 2015 | |
Statistical Computing for Scientists and Engineers | Zabaras | None | University of Notre Dame | 2018 | |
Structure and Interpretation of Computer Programs | DeNero | CS 61A | University of California, Berkeley | 2019 | |
Tensorflow for Deep Learning Research | Huyen | CS 20SI | Stanford University | 2017 | |
Theories of Deep Learning | Donoho | STATS 385 | Stanford University | 2017 |
Name | Topic | Description |
---|---|---|
Anaconda Python Distribution | Python | Distribution for Python with package manager |
Arxiv Sanity Preserver | Research Searching | Time-saving tool for searching arxiv.org |
arXiv.org | Research Searching | Open access scholarly e-prints |
Authorea | Collaborative Writing | Online scientific document collaboration |
Bokeh | Python | Interactive plotting tools |
cmder | Command Line | Console emulator for Windows |
Colorgorical | Color Palette Generator | Online color palette generator |
CommonMark | Markdown Language | Markdown Language |
Convnet.js | Machine Learning | Simple deep learning package for javascript |
D3.js | Interactive visualization | D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but does require significant skill to use effectively |
Dillinger.io | Markdown Language | Markdown editor |
Draw.io | Graphics | Online graphics platform |
Explain Shell | Command Line | Seach command-lines to see the help text that matches each argument |
Fabric | Python | Command line automation tool |
Fast.ai | Machine Learning | The fastai library simplifies training fast and accurate neural nets using modern best practices |
Git | Version Control | Open source distributed version control system - the de facto standard |
Github | Version Control | Web hosting for git repositories |
Github Pages | Web Publishing | Host web pages from Github repositories |
Google Colab | Compute | Colab allows you to write and execute Python code in your browser with free access to GPUs |
Google Earth Engine | Geospatial data | Geospatial data analysis tool |
Google Scholar | Research Searching | Search tool for scholarly literature |
Google Scholar Top Publications | Publication Rankings | Searchable list of conference and journal rankings by field |
Google Seedbank | Machine Learning | Machine learning idea generation tool |
Google Style Guide | Programming | Style guide for Python, R, Shell, HTML, CSS, Javascript, Java, and C++ |
Guild.ai | Experiment tracking | Run, track, and compare machine learning experiments |
Hackmd.io | Markdown Language | Markdown editor |
Jupyter Notebook | Programming | This application allows you to create and share documents that contain live code, equations, visualizations and explanatory text |
Keras | Machine Learning | High-level neural network Python library enabling fast experimentation |
Latex for PowerPoint | Math Editor | Use Latex to write and edit math in PowerPoint |
Mathcha.io | Math Editor | Online math editor that is what-you-see-is-what-you-get; also allows latex syntax |
Microsoft Academic | Research Searching | Search tool for scholarly literature |
Mode | SQL | Simple SQL query and visualization tool |
Oh my zsh | Command Line | Command line themes and plugins, better tab completion and more |
Open source license guide | License | A guide to choosing an open source license |
OpenAI Gym | Reinforcement Learning | A toolkit for developing and comparing reinforcement learning algorithms |
OpenAI Universe | Reinforcement Learning | A toolkit for developing and comparing reinforcement learning algorithms, particularly video games |
Our World in Data Grapher | Visualization | Open source tools for data visualization |
Overleaf | Collaborative Writing | Online LaTeX collaboration |
Plot.ly for Python | Python | Interactive plotting tools |
PyFormat | Python | Explanation of formatting in Python |
Pytorch | Machine Learning | Open source deep machine learning framework |
Regexer | Regular Expressions | Interactive regular expression playground |
Reinforce JS | Machine Learning | Reinforcement learning package for Javascript |
Rodeo | Python | A Python integrated development environment |
SACRED | Experiment tracking | Sacred is a tool to help you configure, organize, log and reproduce experiments |
Scimago Journal Rankings | Publication Rankings | Searchable list of conference and journal rankings by field |
Scrapy | Web Scraping | Scrape data from the web |
Scrollama | Interactive visualization | Scrollers for interactive web visualizations |
Semantic Sanity | Research Searching | Time-saving tool for searching arxiv.org |
Semantic Scholar | Research Searching | AI-backed search engine for scientific journals |
Snorkel | Machine Learning | Programmatically building and managing training data |
So You Want to Build A Scroller | Interactive visualization | Scrollers for interactive web visualizations |
Stackedit.io | Markdown Language | Markdown editor |
Style Guide for Python Code | Python | Programming style guide |
Tableau | Data visualization | Graphical user interface-based data visualization tool |
Tabula | Data Scraping | Extract data from tables |
Tensorflow | Machine Learning | Open source deep machine learning framework |
Tensorflow JS | Machine Learning | Deep learning package for Javascript |
Tensorflow Playground | Neural Networks | Interactive neural network playground |
Tensorwatch | Debugging | Deep learning debuggin tool from Microsoft |
The Neural Network Zoo | Neural Networks | A graphical cheat sheet for neural network architectures and acronyms |
Tmux | Programming | Terminal multiplexer |
Tuna | Python | Python profiler and performance analysis tool |
Weights and biases | Experiment tracking | Record metrics, visualize training and share findings |
Zotero | Reference Management | Reference and citation management system for research |
Author | Organization | Name | Description |
---|---|---|---|
Kernel Functions | Abu-Mostafa, Yaser | Caltech | Description of kernel functions and how they are used |
Gradient Boosting and Machine Learning | Hastie, Treveor | H2O.ai | Discussion of ensemble learning including random forests and gradient boosting |
Natural Language Processing | Jurafsky, Dan | Stanford | Video series on natural language processing (text analysis) |
Machine Learning | Klein, Dan and Pieter Abbeel | Berkeley | Artificial Intelligence and Reinforcement Learning lectures |
Nuts and Bolts of Applying Deep Learning | Ng, Andrew | Deep Learning School | Andrew Ng speaks on advice for those looking to enter the field of machine learning |
Essence of Linear Algebra | Sanderson, Grant | 3Brown1Blue | Video series on a geometric interpretation of linear algebra concepts |
Neural Networks | Sanderson, Grant | 3Brown1Blue | Introductory video series on neural networks |
Taylor Series | Sanderson, Grant | 3Brown1Blue | Clear description of Taylor Series |
Learning to See | Welch, Stephen | Welch Labs | Intuitive, visual explanation of machine learning |
Neural Networks Demystified | Welch, Stephen | Welch Labs | Visual introduction to neural networks |
Support Vector Machines | Winston, Patrick | MIT Open Courseware | An exceedingly lucid explantion of support vector machines - intuitively and mathematically |