We're still working out what might be most useful here for people who are new to working with data. Let us know what you think of these materials and what else you'd like to see! Please add a support ticket or email Arcus Education to let us know what we can improve or suggest additional topics.
First, some various and sundry things that might interest you, beginning with some introductory modules from the DART (Data and Analytics for Research Training) program.
DART includes dozens of data science modules that are each 1 hour or less in duration and with a narrow focus and clear learning objectives. They are asynchronous and you can take them at any time!
Arcus Education's DART program is the result of an NIH grant aimed at educating biomedical researchers. If you'd like to learn more about DART, fill out our interest form or email us at dart@chop.edu.
Training Modules:
To begin learning about data science, we recommend starting with these modules:
For a deeper introduction to biomedical data science, click to see the full 15 module pathway.
Order | Module | Description | Estimated Time |
---|---|---|---|
1 | Reproducibility, Generalizability, and Reuse | This module provides learners with an approachable introduction to the concepts and impact of research reproducibility, generalizability, and data reuse, and how technical approaches can help make these goals more attainable. | 60 min |
2 | How to Troubleshoot | Learning to use technical methods like coding and version control in your research inevitably means running into problems. Learn practical methods for troubleshooting and moving past error codes and other difficulties. | 30 min |
3 | Learning to Learn Data Science | Discover how learning data science is different than learning other subjects. | 20 min |
4 | Demystifying Geospatial Data | This module is a brief introduction to geospatial (location) data. | 15 min |
5 | Omics Orientation | This module provides a brief introduction to omics and its associated fields. | 15 min |
6 | Demystifying SQL | SQL is a relational database solution that has been around for decades. Learn more about this technology at a high level, without having to write code. | 40 min |
7 | Demystifying Machine Learning | An approachable and practical introduction to machine learning for biomedical researchers. | 60 min |
8 | Demystifying Large Language Models | Learn about large language models (LLM) like ChatGPT. | 60 min |
9 | Demystifying Python | This module introduces the Python programming language, explores why Python is useful in research, and describes how to download Python and Jupyter. | 20 min |
10 | Demystifying Regular Expressions | Learn about pattern matching using regular expressions, or regex. | 30 min |
11 | Citizen Science | This is an overview of citizen science for biomedical researchers. | 45 min |
12 | Demystifying Containers | Containers can be a useful tool for reproducible workflows and collaboration. This module describes what containers are, why a researcher might want to use them, and what your options are for implementation. | 20 min |
13 | Intro to Version Control | An introduction to what version control systems do and why you might want to use one. | 15 min |
14 | Directories and File Paths | In this module, learners will explore what a directory is and how to describe the location of a file using its file path. | 15 min |
15 | Research Data Management Basics | Learn the basics about research data management. | 40 min |
Additionally, beyond the NIH grant, we have other articles and miscellany we suggest, whether those are resources we've created in Arcus, or things we recommend from the larger data science community.
Other Resources:
- If you're new to working with data, Arcus Education provides a quickstart guide for data science.
- How open is your science? Take the quiz.
- What is p-hacking and why does it matter? Check out this p-hacker app with fake data and another p-hacker app using real data to see a demonstration.
Pivoting to a data science methodology, one that prioritizes writing code and using version control, can be challenging! We invite you to consider an overview of why this approach is helpful in our Reproducibility training module. This can help give you additional motivation when you're struggling with common challenges like:
- Typing computer commands when you're used to point and click
- Programming when you're used to spreadsheet data analysis
- Version control and text files when you're used to MS Office
- Learning a lot of new things at once and aren't certain how to prioritize and schedule learning tasks
- New rewards and new frustrations
Literate statistical programming resources you might find useful include:
- Literate Statistical Programming
- A video tutorial on R-Markdown and our
- R-Markdown 101 guide.
- Code Readability
If you aren't used to working with raw data that hasn't been pre-groomed, you might feel overwhelmed. Here's a few things to consider that can help!
- Clinical data at CHOP: What is where?
- Data dictionaries and variable names: make sure you watch the SQL training videos in your lab, where we go over the use of
dd_field
anddd_table
, to help you work with data dictionaries. - Errors, outliers, and typos do exist, because people make mistakes in charting, and the medical record system isn't foolproof. What will you do with unexpected or unlikely data?
- Repeated versions of a variable -- Here's an article that looks at repeated variable versions in R
- Missing values: None, NULL, NaN, etc. may also be a challenge. Check out this article on missing data.
Sometimes, facing an empty screen without any code yet written can feel like an unwieldy problem without any hand holds. How do you even get started? We encourage you to think about:
- Breaking ideas down into steps (for example, use pseudocode, a fancy word for "steps written out in fake code" to plan your data selection and data analysis)
- Finding where to start (what fields already exist in the data vs. what will you need to refine/calculate/reformat)
- Learning to work with your lab directory structure. A resource that might help is our short Directories and File Paths course or the article File Paths for Data Scientists
- Conventions for naming files. You want to be on the same page as the rest of the team and watch out for things to avoid (like spaces in file names, which can be annoying in some environments).
- Translating ideas into questions, and questions into tests: what is your algorithm to get from data to publication?
How much troubleshooting is normal? It's tricky. When you're first getting started, some of your questions and challenges are the kinds of things that a more experienced user could sort out in a moment or two. That's why it makes sense to reach out before you get too frustrated and exhausted and ask for help in your very early steps. But as your research advances, you'll be doing novel things that few others do in quite the same way, so you can expect to do much more of the heavy lifting yourself. It could be that no one has yet answered the question of "how do I do this thing", so that asking for and getting help will require a lot of work on your part.
There's a great How to Troubleshoot course that might be helpful, and don't forget about the Arcus Forum. For questions in R, Python, and SQL, consider checking out the educational support available through the Arcus Help Center.