2022 Speakers

Andrew Pham and Winnie Gong

San Diego County, CA

Session: Data Science for Good

Use of R to Educate Underserved Communities Nationwide

San Diego, as a border city, is home to a diverse ethnic population, within which Asian and Hispanic groups show higher incidence of silent diseases like Hepatitis B or diabetes. As its name implies, silent diseases rarely present symptoms until a late stage. By making the public aware of the importance behind vigilance, screening, and treatment of such diseases, education is a vital public health intervention to mitigate disease transmission.

The approach we, the Asian Pacific Health Foundation (APHF), have taken in educating communities is the train-the-trainer model, where individuals within a community can be educated on any health topic, and in turn become trainers that can educate others within a community or social network. By treating information like a transmissible disease, everyone has a role in decreasing transmission of silent diseases.

In this model, a recurring issue we have faced is motivating trainers to continue educating others about what they have learned. Even when there are motivated trainers, visibility of their efforts was limited to members of our organization. To address these issues and expand our reach, we have been using the Shiny R package to develop a webapp, currently dubbed the “APHF Community Education Tracking App” (CETA). This application is meant to compliment the train-the trainer model and serve as a publicly accessible database for education statistics on various health topics. By tracking education efforts at both the individual and community level, we aim to drive the spread of health information through communities across the nation.

At the individual level, CETA tracks personal education efforts to motivate trainers in their teaching. Through the administration of a knowledge assessment for any health topic, such as viral hepatitis, educated individuals can be added to the database of those knowledgeable about viral hepatitis and given their own ID. If these individuals continue on to be trainers, they can educate others on the topic and then direct them to CETA. These trainees can give credit to their trainers by listing the trainer ID. Trainers can track their profile statistics to see how many individuals they have directly educated. In addition, trainers can see how many people they have indirectly educated through their trainees’ education efforts, and so on. Ultimately, we plan to allow trainers to see the geographical reach of their efforts, a visual representation of their individual impact within communities nationwide.

At the community level, CETA serves as a platform in which large-scale education efforts can be tracked. As a visual database, CETA can be used to analyze how well health information is transmitted through a diverse range of communities. Through a zip code search, CETA allows any individual to see how many community members have been educated in a geographic area, filtered across various health topics. With this information, deficiencies and disparities in health education and resources can be identified. Similar to a disease transmission map, data from CETA can be used to better inform resource allocation.

CETA is meant to be easily expandible in terms of content and reach. Inherently, its modular design allows virtually any health topic to be added and can be made to accomodate sociocultural topics, given that the appropriate resources are provided. As a virtual platform, CETA opens the door to partner with organizations across the nation with a mutual goal for community education and health outreach. As the reach of CETA expands, so would the ability for public health research to close education disparities and create tangible community change.

At its core, CETA is a tool accessible to volunteer trainers, resource centers, and public offices alike to understand where and how educational inequities present within a community. With its development, we hope to not only increase individual awareness of health topics, but to start conversations in which communities can begin to take control of their own education.

Anthony Lusardi

Oregon State University Bonneville Power Administration, Corvallis, OR

Session: Shiny and R

Improving Energy Imbalances in PNW Using R

My Name is Anthony Lusardi and I work at Bonneville Power Administration in power generation support. My job is to assist decision makers to manage the entire system of the Columbia River basin with critical information it can produce power and function efficiently. In my day to day, I use R to provide Shiny applications that provide critical information towards planning and forecasting for Pacific Northwest Power grid. I’m not alone though! Our Data Analytics Community at Bonneville Power administration is centered around sharing knowledge, resources and we welcome analysts from all walks in their experience to use their skills to deliver critical information and demonstrate their skills effectively.

What kind of critical information you may ask? Every second thousands of megawatts are being sent and received across the Pacific Northwest. Just for reference, one megawatt hour alone can provide energy for 400 to 1000 homes! This kind of transmission of energy demand (or load) needs to equal generation. This is a fundamental concept that always needs to be held perfectly balance. If not, there are catastrophic consequences. While the Pacific Northwest has always been a leader in hydroelectric power, and carbon free energy, it has joined a conglomerate effort with neighboring states to join a centralized market called the Western Energy Imbalance Market. This centralized market allows for transactions of energy on a 5-minute basis. This short interval allows for the incorporation of renewable energy to be incorporated onto the grid while algorithms and specialists solve for the least cost solution to incorporate variable generation and sell it to customers for the best price.

So how does R come in handy? While computers do most of the brute effort of solving for the least cost solution for the grid to meet load. Analysts need to understand fluctuations in load and demand to fix the system when it broken or improve it when there are changes. If you can imagine that for each transaction there are hundreds of Megawatts, being dispatched and rerouted based on demand, and generation. The decision makers need to resolve important discrepancies in the market such as Resource sufficiency test or testing how much energy we as BPA can provide to the market before the computer solves for them. We also need to consider how our water is passing through the dams and being used to respond to unforeseen activity in the market. While some dams provide generation, others can be used in case of emergencies or contingency events. For each of these needs, a Shiny application was made to provide the operators of the Pacific northwest the information necessary to utilize the entire Columbia Basin to get you the energy you had yesterday you have today and will have tomorrow. Lets take a look at each of these Applications. Shiny app 1 : Lower Snake Reports

This app pulls hourly data from a database called Plant Information or PI into an interactive shiny dashboard that can summarizes vital operation information about the dams on the Lower Snake River for stakeholders. Each of the tabs present demonstrate hourly measurements of water levels, generation and where the overall system is allocating the contributions in actual or potential energy from the Lower snake dams. This shiny application is a one-stop-shop to find information about the dams without having to do extensive queries each day or manually summarize information. By creating a dashboard that has automated and easy to access information, BPA can quickly reference the current usage of these dams to public interest groups or internally to make better decisions about future plans for the Lower Snake and its relevance to the EIM. Shiny app 2: Automated Non Scheduled Load updates

Forecasting and planning for the market is a tricky but exciting job. Thankfully, this R application helps visualize historical behavior of how a part of load has been behaving over time. On daily basis, some customers called slice customers can determine up to 40 minute before the hour how much energy they want to purchase. This is a part of the forecast that needs to be modeled in Non-Scheduled Load or NSL. At an time up to 40 minutes beforehand, any slice customer can change their ask for power which in turn changes our plan to deliver that power. Once submitted it uploads into a database but not passed onto operators. This database was particularly hard to extract data from because not only do we want the current information, we need the past information at certain ‘as-of’ time indicated by the T-# tags. Without the critical information of demonstrating the points in time where the most changes occurred, it was difficult to determine when NSL was changing the most. Now with this application not only do we have an automated updates, we have a growing pool of information that is useful for analysis in the future.

Without the use of R Shiny and the R Connect environment, the ability to quickly access this information would be completely lost. It is an invaluable tool utilized by many analysts at Bonneville Power Administration and continues to be one of the most valuable resources in the data analytics community. If you are interested in learning more about the Pacific Northwest Power grid , Bonneville or joining the Data Analytics community in the Energy Industry feel free to contact me.

Ben Matheson

Session: Reporting & Sharing of R!

How Anchorage Built Alaska’s Vaccine Finder with R

In January 2021, Alaska residents seeking a COVID-19 vaccine appointment faced a convoluted maze of websites. The software was made for providers—not for residents. The Anchorage Innovation Team built a fast, and mobile vaccine finder website for Alaska using R. What started as a web scraping prototype launched statewide one week later and ultimately connected tens of thousands of Alaskans to a vaccine.This talk will cover how we used R to build Alaska's vaccine finder and how R is a powerful protoyping tool that can supercharge projects. It will cover:

  • Prototyping and rapid product development with R
  • Scraping and http packages (rvest & httr)
  • Using Heroku and S3 to run R jobs 24/7
  • Creating a flexible data service with R

Dan McGuire

NanoString Technologies, Seattle, WA

Model-based cell typing for single cell spatial transcriptomics

A new generation of spatial transcriptomics platforms measure single cell gene expression while retaining cell’s locations within tissues. Accurate cell typing - using cell’s gene expression profiles to assign them to a cluster or cell type – is fundamental to analyzing this data. We introduce the Insitutype cell typing algorithm, designed to maximal statistical power in spatial transcriptomics data. Insitutype can perform unsupervised clustering via an EM algorithm, supervised cell classification via a Bayes classifier, or semi-supervised detection of unknown clusters alongside reference cell types. Cells’ images and spatial contexts can be harnessed for further information.

Dror Berel

AWS, Seattle, WA

Session: Developing in R

Benchmarking AWS Sagemaker ML and Forecasting Algorithm Performance with Traditional R Packages

AWS SageMaker SDK provides a fully-managed service for various ML and forecasting containers, to be deployed as a sagemaker endpoint. Other than the production aspect, and integration with other AWS tools, which is not the focus of this talk, these containers could be seen as a black box mechanism, capable of high scalability and sufficient model performance, with minimum settings required by the end user (will be termed here as autoML). In this talk I will attempt to compare the model performances of SageMaker autoML containers, with traditional R packages for ML and forecasting (mlr3, tidyverts, …), for a couple of case studies. I will then try to stretch each of these tools for their limits and benchmark their performances. This talk will also demonstrate how one can integrate their own code into a SageMaker pipeline.

Jeff Rothschild

Sports Performance Research Institute New Zealand, Auckland, NZ

Session: Shiny and R

A Dashboard for Calculating hourly energy balance

This talk shares a shiny dashboard I created using flexdashboard: https://rothschild.shinyapps.io/Energy-balance/

This app calculates hourly energy balance and allows an athlete/coach/dietitian to identify if someone is spending too much time in too big of an energy surplus or deficit, and plan meal sizes accordingly based on their exercise. As a dietitian this is something I’ve thought about and wanted to build for quite a while, as many people focus too much on daily energy balance and underestimate the importance of within-day balance. Put simply, if someone exercises in the morning and eats all of their food at dinner time, even if they are in a calorie balance it is not the healthiest way of eating. The talk will explain the rationale for the app and walk through the steps I took to build it – starting with the thought process and a static r script, adding the interactive features, and customizing the design. I will also talk about some of the few challenges I faced along the way such as customizing the CSS and needing to combine characters and a continuous ggplot scale.

Josh Roll

Oregon Department of Transportation, Portland Oregon

Session: Data Science for Good

Using R to Understand Pedestrian Traffic Injury Inequities in Oregon

The Oregon Department of Transportation recently completed research to better understand the magnitude of disparities of injured pedestrians in the state. Using R for all of the data wrangling, analysis and visualizations, researchers documented the existing inequities for pedestrian injuries in Oregon. Multiple analyses were performed that relied on a number of tools and packages within R. Many of these analyses have been posted to Github to ease the burden for analysts in other states and regions to measure pedestrian injury inequity. This presentation will highlight the various elements of R utilized in this research and the high level outcomes documented in the research.
Published report links
https://www.oregon.gov/odot/Programs/ResearchDocuments/SPR%20841Injuries-Equity.pdf https://linkinghub.elsevier.com/retrieve/pii/S1361920922001225
Github resources:
https://github.com/JoshRoll/Pedestrian-Fatal_Injury_Rate https://github.com/JoshRoll/Pedestrian-Injury-Disparity-Index-Analysis

Nick Sun

Natera, Inc., Bellevue, WA

Session: Shiny and R

Making Power Calculations Easier with Shiny

Hypothesis tests in the medical devices industry are often concerned with probabilities e.g. sensitivity and specificity. Since power curves for these studies exhibit a non-monotonic, sawtooth pattern, power calculations are required for effective study planning. These calculations can be done relatively quickly in R, but this may be limiting for statisticians and scientists with limited R programming experience. At Natera, we have made many of our proportion based power calculations available in a single accessible Shiny dashboard so that users can investigate the relationships between sample size and power, regardless of technical acumen. This app has allowed our statisticians to spend more time participating in meaningful cross team discussion and less time coding repetitive calculations, as well as giving other technical teams a straightforward starting point for sample size discussions.

Oluwafemi Oyedele

International Institute of Tropical Agriculture, Nigeria

Session: Reporting & Sharing of R!

Reproducible Research with R

Reproducible research is the process whereby future you and anyone else will be able to pick up your analysis and reproduce the same results, including figures and tables. Reproducible research also implies well-documented research, your code should be well commented and the reasons behind functions and methods should be explained thoroughly throughout the analysis. The communication aspect should not be an afterthought, it should be recorded with your analysis as you are going through it. The main aim of this talk is to set you on the right path of making your research more reproducible and sharable. Quarto is the next generation of RMarkdown for publishing, including dynamic and static documents and multi-lingual programming language support. The ability to easily collaborate and share your analysis goes hand-in-hand with good record-keeping and reproducibility. We are going to repurpose the git version control tool and leverage the GitHub remote hosting provider for managing and sharing our work. Git + GitHub will provide a very powerful resource for global collaboration and exposure of your work. In this talk, we are going to version control our work and push it to github, which can then be accessed by collaborators and supervisors. Git + GitHub should become an integral part of your workflow. This talk is for people with little or no prior knowledge of R, Rmarkdown and quarto.

Priyanka Gagneja


Session: Developing in R

Going Beyond Tidyverse 101

Key takeaway from this session would be to help you all to be able to write shorter, cleaner code that makes it more readable, making it:

  • Easier to debug in case of issues.
  • Easier to transition to & from others including your future self,

I want to highlight a few hidden gems I have recently identified, when looking to achieve something very specific. Hidden because they are always there in the documentation, which we tend to overlook as long as we are able to run the function in its basic form, without changing much of default settings, which is where we will try to dig in:

We look into the functions, some unnoticed options, and use cases. For example,

  1. Group_by : .drop = false
  2. .before /.after in a mutate.

Radhika Etikala and Emily Zhang

Fred Hutch, Seattle, WA

Session: Creating an R Environment for a Team

Wrangling the Stakeholder Community – Data Specifications

The specification is a document describing the creation of a derived datasets. Specification document contains but not limited to data sources (input data), study-specific processing requirements, an overview of processing steps, a list of included variables, a brief description of variable content, detailed derivation notes, and designation of key variables. Also includes the version of the specifications or other documents used, decisions, changes made, and all the communication is incorporated into the specification. This presentation will talk about the importance of data specifications, and how data FAIRness (findable, accessible, interoperable, reusable) impacts the overall productivity and quality.

Findability is important, when different teams and departments look at the data to gain further insights and to perform further analysis, after certain period. It is often very difficult to find information (documentation, programs, data), so data specifications fill the gaps to find information even years later. Accessibility is key as well, as our data is getting from various sources and has levels of restrictions. Often programmers struggle to access the data from different locations due to permission issues. Data without the necessary specifications / documentation is useless. Any data sharing needs to contain data and the necessary documentation and metadata. Data FAIRness should bring structure and should make data re-usable. And this presentation further talks about why good stakeholder-relationships are important and to lift statistical programming to the next level to become an equal partner with clinical teams and stakeholders.

Rodger Zou and Radhika Etikala

Fred Hutch, Seattle, WA

Session: Creating an R Environment for Team

Towards a Collaborative R Prog Group: Verification

SCHARP (Statistical Center for HIV/AIDS Research and Prevention) is a part of the Fred Hutch that is committed to supporting clinical research with high quality clinical and laboratory data management and statistical services. Assurance of the quality and correctness of the data is paramount to our goals and objectives. Therefore, an important part of our day to day is the verification of the processed data that we need to return to collaborators and researchers. But what is verification? It’s hard to define, as verification of processed experimental data is an important, ever evolving and widely discussed topic at SCHARP, and the idea of verification can vary from industry to academia.

This presentation details the framework of our verification process, our reasoning behind it and how we utilize R in pursuit of it. We start by outlining the various risk levels that need to be addressed when processing data at SCHARP, and what types of verification can be used to address these different risks. In addition, processing data in R can be straightforward, but verification might require a second, independent processing script using different packages in order to confirm the correctness of a dataset. We have examples where the {arsenal} and {testthat} packages prove useful to help us determine whether two datasets match or determine the points at which they differ. We also utlilize code review and data acceptance checks based on lower levels of risk. How does the R environment also support these types of verification?

Finally, at the heart of verification also lies teamwork and documentation. Which parts of R require teamwork and discussion in order to settle differences between production and verification data? Parts of verification can be communicated through email or messaging programs, but ultimately differences need resolution, and we present stories of such teamwork that can settle such differences.

Tim Anderson

Portland State University, Portland, OR

Session: Reporting & Sharing of R!

Publishing a Book using Only R Ecosystem Tools

After teaching linear and integer programming optimization for years using a variety of proprietary platforms I decided to adopt R in the course to help introduce students to the world of analytics. A series of lecture notes grew into a book length collection and organized into a book using the bookdown and tufte packages. Goals of the project included making a useful resource for students, demonstrating R documentation & reproducibility practices, and retaining open access.

At a 2019 conference CRC Press asked to publish the revised book. While progress slowed due to the pandemic, the book was officially published in July 2022. In this talk I will discuss the lessons learned, including mistakes made. The book's github repository is at https://github.com/prof-anderson/OR_Using_R and is available from Amazon in various formats as ““Optimization Modeling using R”".

A feature that I am proud of is that the book was entirely written using the R and RStudio ecosystem of tools. Over the years, the book (in its various stages) has been used by over 600 analytics and engineering management students and many have made contributions along the way that are acknowledged.

Truzaar Dordi

Victoria, BC

Session: Data Science for Good

Ten Financial Actors Can Accelerate a Transition Away from Fossil Fuels

Our research uses the R igraph package to map ownership dynamics between shareholders and the fossil fuel industry. Outputs include bipartite network visualizations and an interactive network graph. In line with this years theme, this research highlights how network models can inform policy and practice through storytelling on critical climate action. The results of the study find that just ten financial actors are disproportionately responsible for climate instability through their immense investments in fossil fuel firms.

Investors have a central role to play in sustainability transitions, due to their inordinate influence on the governance of the fossil fuel extraction industry. Using network analysis, this paper links fossil fuel firms to equity owners, by distinguishing ownership characteristics of top shareholders and establishing a ranked list of the most prevalent shareholders based on emissions potential and network centrality. Our study reveals that among the most prevalent owners, are government signatories of the Paris accord and prominent American investment managers. We conclude that a concentrated number of investors have the potential to influence the strategic direction and governance of these firms and should consequently be held accountable for financing the economic activities that contribute to climate instability. This paper directly contributes to the fragmented body of academic research on financial systems and sustainability transitions.

Valeria Duran and Emily Zhang

Fred Hutch, Seattle, WA

Session: Creating an R Environment for a Team

The Challenges of Legacy Code

Legacy code is code written by previous programmers, and it is a common scenario that any programmer has come across when initiating a new task. Working with unfamiliar code, especially in the absence of its creator, can be tedious and stressful for any user, and can oftentimes lead to further contribution of legacy code when not taking the proper steps and precautions. There are multiple aspects to consider, such as verifying the code runs, understanding the ins and outs of the code enough to effectively make updates or enhancements for use cases, and finding ways to prevent further contribution to legacy code. To address this issue, many steps can be put in place to facilitate the handling of legacy code, such as using test data to run the code, reviewing of code, comparing code with specifications, and using validated tools and packages for consistency and version control. In this talk, we will present an overview of how SCHARP has implemented protocols to work with legacy code.