Virtual Applied Data Science Training Institute (VADSTI)

Data Science Approaches to Better Understand Clinical and Genomic Informatics
FALL TRAINING SERIES: September 1, 2022 – October 28, 2022

About VADSTI

Technological advancements and efficient use of computational tools have made it possible to
generate and store large amounts of heterogeneous and complex datasets in many disciplines,
including public health, clinical, biomedical, and genomics. There is therefore increased demand
for data analytics capabilities to look at trends, predict outcomes, and make better clinical and
health policy decisions. Skill sets in data science are particularly critical for advancing the
science of minority health and health disparities. The Howard University Research Centers in
Minority Institutions, RCMI, Program, funded by NIMHD, and supported by the AIM-
AHEAD program, is pleased to announce VADSTI 2.0, Fall Training Series to the Howard
University community of researchers and beyond. We aim to enhance data science capability
and application by providing training in the foundations of programming and critical data
analytic skills for planning and conducting research involving big data pertinent to minority
health and health equities. The Fall Training Series is project-based and will cover topics
including Data Preparation, Exploration and Visualization, Classification Models, Data
Clustering and Dimension Reduction, and Machine Learning and Predictive Analytics.

To register click the following link – https://vadsti_fall22.eventbrite.com

For questions, contact VADSTI at vadsti@howard.edu or John Kwagyan, PhD at jkwagyan@howard.edu

Program Objectives & Competencies

The primary objective of the 2022 VADSTI program is to provide training in the foundations of
data science and advance analytic skills and introduce tools for clinical and genomic research.
Over the course of the 8-week training program you will:

Be introduced to the foundations of data science.
Be introduced to Python programming skills.
Gain practical, hands-on experience with Python and related libraries for accessing data.
Learn about the underlying concepts of probability and statistics for data analytics.
Be introduced to advanced analysis techniques utilized in biomedical, clinical, and genomic
research.
Understand the concepts of data partitioning and practice behind supervised and unsupervised
learning.
Be introduced to algorithmic methods, including machine learning and deep learning.
Complete and submit health-related data science project.

Certificate of Completion: Participants who complete and submit their projects in the VADSTI
GibHub Data Science Project Portfolio will receive a verified digital certificate of completion.

Evaluation: At the end of each training module, you will be requested to complete electronic
feedback forms on the extent to which expectations and objectives were met.

Registration & Fees: No fees for participation, but registration is required to attend.

VADSTI Training Program Schedule

No prerequisite for research knowledge topics. Basic undergraduate knowledge of algebra and
probability recommended for content knowledge topics. The training series consists of the
following modules.

Pre-Training Sessions

Pre-Training Session I
Foundations of Data Science

Thursday, September 1, & Friday, September 2, 2022
11:00 AM – 2:00 PM EST

Pre-Training Session II
Introduction to Python

Thursday, September 8, & Friday, September 9, 2022
11:00 AM – 2:00 PM EST

Pre-Training Session III
Probability, Statistical Inference and Regression Models
Thursday, September 15, & Friday, September 16, 2022
11:00 AM – 2:00 PM EST

Training Sessions

Module 1
Data Preparation, Exploration, and Visualization

Thursday, September 22, & Friday, September 23, 2022
11:00 AM – 2:00 PM EST

Module 2
Classification Models

Thursday, September 29, & Friday, September 30, 2022
11:00 AM – 2:00 PM EST

Module 3
Data Clustering and Dimensionality Reduction

Thursday, October 6, & Friday, October 7, 2022
11:00 AM – 2:00 PM EST

Module 4
Seminal Presentation on Current Research Topics
Thursday, October 13 & October 14, 2022
11:00 AM – 2:00 PM EST

Module 5
Machine Learning and Predictive Models I
Thursday, October 20, & Friday, October 21, 2022
11:00 AM – 2:00 PM EST

Module 6
Machine Learning and Predictive Models II
Thursday, October 27, & Friday, October 28, 2022
11:00 AM – 2:00 PM EST

VADSTI Training Program Curriculum

No prerequisite for research knowledge topics. Basic undergraduate knowledge of algebra and probability is recommended for content knowledge topics. The training series consists of the following modules.

Pre-Training Sessions

The 3-week pre-training sessions are live discussions facilitated by faculty of lecture recordings of topics from the Spring Training Series, prior to the start of the main Fall Training series. Participants who are beginner data science learners and/or who lack the critical foundational elements of data science, Python programming skills, and statistical concepts are encouraged to review all the recordings and required to attend the live training discussion sessions. Topics for these discussion sessions are:

Week 1) Pre-Training Session 1 | Foundations of Data Science

Thursday, September 1, & Friday, September 2, 2022
11:00 AM – 2:00 PM EST

FACILITATORS – Moussa Doumbia, Ph.D., William Ampey, MS, Ph.D._c, Kwasi Yeboah-Afihene, Ph.D.

Week 2) Pre-Training Session 2 | Introduction to Python

Thursday, September 8, & Friday, September 9, 2022
11:00 AM – 2:00 PM EST

FACILITATORS – Moussa Doumbia, Ph.D., Ebelechukwu Nwafor, Ph.D., William Ampey, MS, PhD_c, Kwasi Yeboah-Afihene, Ph.D.

Week 3) Pre-Training Session 3 | Probability, Statistical Inference and Regression Models

Thursday, September 15, & Friday, September 16, 2022
11:00 AM – 2:00 PM EST

FACILITATORS – John Kwagyan, Ph.D., William Ampey, MS, PhD_c

Training Sessions

Week 4) Module 1 | Data Preparation, Exploration, and Visualization

Thursday, September 22, & Friday, September 23, 2022
11:00 AM – 2:00 PM EST

INSTRUCTOR – Ebelechukwu Nwafor, PhD

This module provides recipes for data preparation, exploration, and visualization, which are critical steps in any data science project. The goal of this module is for participants to learn how to visualize and perform initial investigations of the data to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. We will be using python to explore, filter, and manipulate various datasets; identify data anomalies and missingness; learn how to impute missing data; identify highly correlated variables. Explore the Johns Hopkins University COVID-19 data repository, import the data, and wrangle the data to look at the number of confirmed cases by country and region; plot the number of reported confirmed cases and deaths by country. In addition, we will use the COVID-19 tracking project dataset to explore racial disparities in COVID-19 mortality and infections in the US

Week 5) Module 2 | Classification Models

Thursday, September 29, & Friday, September 30, 2022
11:00 AM – 2:00 PM EST

INSTRUCTOR – Moussa Doumbia, PhD

In this module, you will first understand the criteria about which classification model to use for a given data science project. Participants will understand the concepts behind the Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forests and Support Vector Machines. Using Python, participants will learn how to apply and interpret these classification models to real-world health disparities and
equity datasets.

Week 6) Module 3 | Data Clustering and Dimensionality Reduction

Thursday, October 6, & Friday, October 7, 2022
11:00 AM – 2:00 PM EST

INSTRUCTOR: Martin Skarzynski, PhD

The module will discuss Data Description and Clustering. Similarity measures and dimensionality reduction will be addressed. Learn about the k-means algorithm and hierarchical clustering. Unsupervised learning with clinical and genomic datasets will be used for illustration.

Week 7) Module 4 | Seminal Presentation on Current Research Topics:

Thursday, October 13 & October 14, 2022
11:00 AM – 2:00 PM EST

5.1: Ethical Data Science | Thursday, 11:00- 12:15 PM

Presenter: Rochelle Tractenberg, PhD

5.2: Data Science and Health Environment | Thursday, 12:30-2:00 PM

Presenter:

5.3: Data Science and Social Justice | Friday, 11:00-12:15 PM

Presenter: Johanna Hardin, PhD

5.4: Elements and Use of EHR Discovery | Friday, 12:30-2:00 PM

Presenter: TBD

Week 8) Module 5 | Machine Learning and Predictive Models I
Thursday, October 21, & Friday, October 28, 2022
11:00 AM – 2:00 PM ESTINSTRUCTOR – Zhe Fei, PhD This module focuses on predictive models for high dimensional data, for examples, genetics, epigenetics, and other biomedical data with large numbers of predictors. Several model selection techniques will be introduced, including Subset Selection, Shrinkage Methods, and Methods Using Derived Input. Furthermore, we will introduce Resampling Methods for model assessment and model inference, including Cross Validation, Bootstrap, and Bagging. Real-world datasets will be used for examples.
Week 9) Module 6 | Machine Learning and Predictive Models II

Thursday, October 20, & Friday, October 21, 2022
11:00 AM – 2:00 PM EST

INSTRUCTOR – Zhe Fei, PhD

This module focuses on non-linear predictive models for complex biomedical data, for examples, time series data, medical imaging, image segmentation, etc. We will start with non-linear univariate tools, Kernel Smoothing and Splines. Then we will introduce Neural Networks, the basics and extensions. We will also compare neural networks with two popular machine learning methods for classification, Random Forest, and Support Vector Machines. Examples from real datasets will be used for illustration.