# How do I get started?

I’m frequently asked by students, especially neuroscience students, how they should go about improving their {programming, computing, statistics} skills. This page is partly an answer to that. It’s mostly my opinions, with no claim to being comprehensive. The wonderful upside of learning to program in the internet age is that there is so much information and so many options that you don’t have to go with my recommendations.

# Learning to program

• My advice here pertains to scientific programming. If you want to learn web development or build device drivers, this may not be for you.
• StackOverflow. If you have ever used a search engine to look up a programming question, you have probably run across StackOverflow. The site uses a question-and-answer format, with accepted answers clearly marked and the best answers upvoted. The site can be a bit intimidating to use (there are a lot of guidelines for posting a good question), but it’s probably the best programming resource on the internet for passive search. If you’re completely new to programming, it won’t teach you, but for fixing well-defined problems, there’s no substitute.1

• Use whatever the people around you are using. It’s frustrating enough to learn programming; take advantage of local expertise to help you. If you’re struggling to learn functions and if statements, that can be done in pretty much any modern language, and the concepts will carry over to most others.
• That said, here’s my order of preference:
1. Python: Because everything. Python is used for scripting, building and scraping websites, and pretty much anything else where performance isn’t critical. It is also the de facto standard in data science and machine learning. It’s also comparatively easy to learn. Python is the new BASIC. What’s more, Python skills actually help on a resume. I’ll talk more about recommended packages/setup below
2. Julia: This is mostly idiosyncratic to me. As I’ve written elsewhere, I think Julia has a very bright future in scientific computing, though at the time of this writing (June 2018), it’s still in development. Why Julia? Because it has the ease of use of interpreted languages like Python with a performance closer to C or FORTRAN. The downside is that the ecosystem of packages and other niceties surrounding the language is newer and thus comparatively weaker than you might find with Python or R.
3. R: I use R a lot for data analysis. I use R the language only in passing. The R ecosystem is fantastic, and statisticians code, think, and publish in R. If you’re using anything else, the statistical methods available to you take a big hit. Plotting and data wrangling are also top-notch. I recommend RStudio plus everything by Hadley Wickham.
4. Matlab: If you must. Matlab is pervasive in neuroscience and engineering, and it provides a decent ecosystem (professionally supported toolboxes, a decent IDE and debugger) out of the box. Provided, that is, your institution pays the substantial price tag. My complaints about Matlab mostly center on: (a) its painful ergonomics as a programming language2 (I just don’t find it fun to use); and (b) its absence in the software and data science industries (Matlab skills don’t mean much when applying to those jobs).

I’ll be vague here for one reason: there are too many choices, and none is a clear winner. All you really want at this initial phase is an acquaintance with basic programming: variables, control flow, functions, etc.

Some people prefer books here, but in the cases of Python and R there are also lots of free video series and online courses. Which you choose doesn’t matter so long as:

• You devote serious time to learning. Programming is a skill and cannot be crammed.
• You actually write code. This is a bit like learning a foreign language: you have to speak to get better. No passive learning. It really helps to have a project here, even a side project, so you have something to work toward.

If you’re coming to Python from a different language and want a quick overview, I highly recommend Jake Vanderplas’s Whirlwind Tour of Python. It’s perhaps a little more than what many scientists need to know to get started, but it’s free and excellent.

• For Python, once you’ve gotten a basic acquaintance with the language, and after you’ve worked on your scientific programming skills, it’s worth going back to invest in more advanced aspects of the language. This pays dividends both in understanding others’ code and in writing reusable libraries of your own. For Python, I particularly recommend Fluent Python.

# Python for Data Science

Most programming material online is targeted either at students learning their first programming language or professionals learning a new tool for software development. However, programming for science — writing code that runs, simulates, or analyzes experiments — carries its own set of unique challenges, and is distinct from general-purpose programming. That’s why learning to program Python is distinct from learning “scientific Python,” the suite of packages, tools, and practices that surround Python as used in (data) science.

This is why I make every new student in my lab read (cover-to-cover) Jake Vanderplas’s Python Data Science Handbook. The book covers exactly the toolset we use: IPython, Jupyter, NumPy, SciPy, Pandas, Matplotlib, and Scikit-Learn. I don’t know of a better, more comprehensive introduction to modern scientific Python.

# Statistics

Professional disclaimer: I recommend a good grounding in statistical theory. It’s worth the investment.

But we’re all busy people. What I usually end up recommending to students:

• Data Analysis Using Regression and Multilevel/Hierarchical Models. This was my first introduction to applied Bayesian analysis. Surprisingly readable for students without much statistical background and teaches an approach to modeling data that I like and advocate. As a bonus, covers Markov Chain Monte Carlo sampling tools like Stan that are necessary in practice.
• A First Course in Bayesian Statistical Methods. This is the book they use for the intro Bayesian class at Duke. This is really for students who are investing in serious stats education. Finishing this one may not leave you quite ready to tackle your real data, but you will have a solid foundation to build on.
• All of Statistics. A really nice single-volume introduction to statistics. A bit of a steep learning curve for the less mathematically inclined, but worth a mention.
• For Duke students interested in the problem of actually implementing statistical models and methods in code, I highly recommend Cliburn Chan’s STA 663, typically offered each spring. Teaches all the same software tools my lab uses.

# Machine Learning: Classic

There are lots of great references. The current deep learning phase notwithstanding, machine learning is actually a very broad field, and what is old now will eventually be new again. Some references worth checking out:

# Machine Learning: Deep Learning

So Deep Learning (aka neural networks) is eating the world. Briefly:

• Read the Deep Learning Book. It’s even free online from the website. The field is moving incredibly rapidly, but this is now the standard introduction.
• For online classes, we’ve had students take the Stanford convnets class and Coursera’s Deep Learning Specialization. These are pretty basic but nice for people getting started.3
• We use TensorFlow in house for a few reasons:
• TensorBoard
• Support for compiling and deploying models to production. (This is a minor but important consideration for us, as we are not doing pure machine learning research. We have models we might want to deploy to others who aren’t going to install TF.)
• Educational support. TF is used as the tool of choice by Coursera’s Deep Learning Specialization, among others.
• We had legacy Theano code to port, and Theano and TF have the same design philosophy and vocabulary.
• In the future, TF will better support imperative programming with eager mode. Swift for TensorFlow is also interesting but early-stage.
• If I were a young graduate student doing ML research, I would probably opt for PyTorch, which allows imperative programming and is thus much easier to use and debug than TensorFlow. However, TF is rapidly catching up in this area (see last point above).

# Notes

1. Note that information on StackOverflow tends to be proportional to the popularity of a given tool. So information on R and Python is extensive, while Matlab has comparatively less support.

2. To be fair, Matlab is now an old language and was designed to ease the burden of engineers who were coding C and FORTRAN for a living. By those standards, it is highly successful, and new features are being added to the language all the time.

3. Keep in mind that these classes are great at introducing the material, but they tend to be very light on theory and more focused on simple applications. While they’re a great starting point for high school students, undergraduates, or graduate students in other fields, students interested in machine learning research will be expected to engage with these ideas at a much higher mathematical level.