New to data science?

We're still working out what might be most useful here for people who are new to working with data. Let us know what you think of these materials and what else you'd like to see! Please add a support ticket or email Arcus Education to let us know what we can improve or suggest additional topics.

First, some various and sundry things that might interest you, beginning with some introductory modules from the DART (Data and Analytics for Research Training) program.

DART includes dozens of data science modules that are each 1 hour or less in duration and with a narrow focus and clear learning objectives. They are asynchronous and you can take them at any time!

Arcus Education's DART program is the result of an NIH grant aimed at educating biomedical researchers. If you'd like to learn more about DART, fill out our interest form or email us at dart@chop.edu.

Training Modules:

To begin learning about data science, we recommend starting with these modules:

Reproducibility
How to Troubleshoot
Learning How to Learn Data Science

For a deeper introduction to biomedical data science, click to see the full 15 module pathway.

Order	Module	Description	Estimated Time
1	Reproducibility, Generalizability, and Reuse	This module provides learners with an approachable introduction to the concepts and impact of research reproducibility, generalizability, and data reuse, and how technical approaches can help make these goals more attainable.	60 min
2	How to Troubleshoot	Learning to use technical methods like coding and version control in your research inevitably means running into problems. Learn practical methods for troubleshooting and moving past error codes and other difficulties.	30 min
3	Learning to Learn Data Science	Discover how learning data science is different than learning other subjects.	20 min
4	Demystifying Geospatial Data	This module is a brief introduction to geospatial (location) data.	15 min
5	Omics Orientation	This module provides a brief introduction to omics and its associated fields.	15 min
6	Demystifying SQL	SQL is a relational database solution that has been around for decades. Learn more about this technology at a high level, without having to write code.	40 min
7	Demystifying Machine Learning	An approachable and practical introduction to machine learning for biomedical researchers.	60 min
8	Demystifying Large Language Models	Learn about large language models (LLM) like ChatGPT.	60 min
9	Demystifying Python	This module introduces the Python programming language, explores why Python is useful in research, and describes how to download Python and Jupyter.	20 min
10	Demystifying Regular Expressions	Learn about pattern matching using regular expressions, or regex.	30 min
11	Citizen Science	This is an overview of citizen science for biomedical researchers.	45 min
12	Demystifying Containers	Containers can be a useful tool for reproducible workflows and collaboration. This module describes what containers are, why a researcher might want to use them, and what your options are for implementation.	20 min
13	Intro to Version Control	An introduction to what version control systems do and why you might want to use one.	15 min
14	Directories and File Paths	In this module, learners will explore what a directory is and how to describe the location of a file using its file path.	15 min
15	Research Data Management Basics	Learn the basics about research data management.	40 min

Additionally, beyond the NIH grant, we have other articles and miscellany we suggest, whether those are resources we've created in Arcus, or things we recommend from the larger data science community.

Other Resources:

If you're new to working with data, Arcus Education provides a quickstart guide for data science.
How open is your science? Take the quiz.
What is p-hacking and why does it matter? Check out this p-hacker app with fake data and another p-hacker app using real data to see a demonstration.

Pivoting to data science

Pivoting to a data science methodology, one that prioritizes writing code and using version control, can be challenging! We invite you to consider an overview of why this approach is helpful in our Reproducibility training module. This can help give you additional motivation when you're struggling with common challenges like:

Typing computer commands when you're used to point and click
Programming when you're used to spreadsheet data analysis
Version control and text files when you're used to MS Office
Learning a lot of new things at once and aren't certain how to prioritize and schedule learning tasks
New rewards and new frustrations

Literate statistical programming resources you might find useful include:

Literate Statistical Programming
A video tutorial on R-Markdown and our
R-Markdown 101 guide.
Code Readability

Problems in the data

If you aren't used to working with raw data that hasn't been pre-groomed, you might feel overwhelmed. Here's a few things to consider that can help!

Clinical data at CHOP: What is where?
Data dictionaries and variable names: make sure you watch the SQL training videos in your lab, where we go over the use of dd_field and dd_table, to help you work with data dictionaries.
Errors, outliers, and typos do exist, because people make mistakes in charting, and the medical record system isn't foolproof. What will you do with unexpected or unlikely data?
Repeated versions of a variable -- Here's an article that looks at repeated variable versions in R
Missing values: None, NULL, NaN, etc. may also be a challenge. Check out this article on missing data.

Framing your questions so the computer will understand

Sometimes, facing an empty screen without any code yet written can feel like an unwieldy problem without any hand holds. How do you even get started? We encourage you to think about:

Breaking ideas down into steps (for example, use pseudocode, a fancy word for "steps written out in fake code" to plan your data selection and data analysis)
Finding where to start (what fields already exist in the data vs. what will you need to refine/calculate/reformat)
Learning to work with your lab directory structure. A resource that might help is our short Directories and File Paths course or the article File Paths for Data Scientists
Conventions for naming files. You want to be on the same page as the rest of the team and watch out for things to avoid (like spaces in file names, which can be annoying in some environments).
Translating ideas into questions, and questions into tests: what is your algorithm to get from data to publication?

Getting help and troubleshooting

How much troubleshooting is normal? It's tricky. When you're first getting started, some of your questions and challenges are the kinds of things that a more experienced user could sort out in a moment or two. That's why it makes sense to reach out before you get too frustrated and exhausted and ask for help in your very early steps. But as your research advances, you'll be doing novel things that few others do in quite the same way, so you can expect to do much more of the heavy lifting yourself. It could be that no one has yet answered the question of "how do I do this thing", so that asking for and getting help will require a lot of work on your part.

There's a great How to Troubleshoot course that might be helpful, and don't forget about the Arcus Forum. For questions in R, Python, and SQL, consider checking out the educational support available through the Arcus Help Center.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new_to_data_science.md

new_to_data_science.md

New to data science?

Pivoting to data science

Problems in the data

Framing your questions so the computer will understand

Getting help and troubleshooting

Files

new_to_data_science.md

Latest commit

History

new_to_data_science.md

File metadata and controls

New to data science?

Pivoting to data science

Problems in the data

Framing your questions so the computer will understand

Getting help and troubleshooting