Students often face significant technical challenges in their research and coursework. Two new initiatives have provided an opportunity to share needed know-how.Photo:
tec_estromberg, Flickr Creative Commons.
Course, Boot Camp Speed Students’ Know-How in Research Computing and Data Management
It’s 9 am on Friday and a group of 16 faculty, staff, and students are logged into computers in the Health Sciences Library, watching as Systems Administrator Brian High demonstrates how the data analysis software “R” can be used, such as plotting the exact route taken by a cyclist around the city using GPS data from his or her mobile phone.
The R class is day five of a week-long workshop called “Computing Boot Camp,” developed and taught September 12-16, 2016, by the Information Technology (IT) team in our department. The workshops are an adaption of a one-credit course called Research Computing and Data Management offered in 2015 and 2016 and taught by Brian High and others in partnership with Professor Lianne Sheppard.
“We as a department are making sure our students are data-savvy.”
“We as a department are making sure our students are data-savvy,” Sheppard said. “We have all the infrastructure and capability but we are missing the collective knowledge. The Boot Camp [and before that the Research Computing class] provides—in a structured way—the opportunity to share that knowledge.”
The amount of data available is dizzying. A fitbit calculates the number of steps its owner takes each minute of every day. Local air quality monitors measure levels of pollen and particulate matter in the air. Weather conditions are tracked, recorded online. The U.S. Census details population density while the U.S. Department of Transportation maps the highways and the transportation infrastructure.
As a result, students often face significant technical challenges in their research and coursework; they find computing and programming is necessary in accessing, managing and analyzing large amounts of data to answer an important research question.
A graduate student, for example, wanted to download and save all the real-time vehicle and pedestrian wait times at the US Mexican border crossing at San Ysidro, California, which were posted on the US Customs and Border Protection’s website. The data could help her estimate how wait times influenced a person’s exposure to traffic air pollution. Brian High helped her with a solution. He wrote a programing script that “scraped” the web content and set a “cron” utility to run the programming script to scrape data at regular intervals and save it to a file. The script, written in a programming language called Python, parsed the relevant information and converted it to a CSV format. Then, the data could be read into a data analysis application, such as Microsoft Excel or R.
Even if there is data available and you’re able to access it, there are some measurements that have to be “cleaned up,” explained PhD student Graeme Carvlin, who was enrolled in the Research Computing class in Spring 2015. There are mistakes in the data, a snafu of the instrument, an accidental misspelling or when comparing two data sets, the time spans can be slightly off. These differences pile up, he said. Graduate students must learn how to manage the data to use it, which is difficult for those who don’t have a computer background.
“If you’re given this problem with no idea how to approach it, and no idea where to go for resources, that’s where Brian’s class comes in,” Carvlin explained. It’s been an important tool—getting the general awareness out there about what to do in these situations.”
He has plenty of experience managing data. Working with Associate Professor Edmund Seto, Carvlin was investigating the variations of air pollution across Seattle using a Dylos laser air particle monitor that he set up in 10 different residential locations. He compared data from these “microclimates” to the data available from the Puget Sound Clean Air Agency’s fixed air pollution monitors on Beacon Hill and in South Park and the International District and Duwamish Valley.
The Dylos monitor was capturing particulate matter along with air temperature and humidity levels at 10-second intervals. The Clean Air Agency’s data was being captured hourly. Even though Carvlin already had some programming knowledge, he learned a lot from the class, such as when to use Python and when R is preferable. He found Python better to clean the data and R better to analyze the data. He also used Open Refine, what he called a “beefed up version of Excel” to visualize and sort through the thousands of measurements.
The Boot Camp was a condensed version of the Research Computing course. In addition to Brian High, IT Director Jim Hogan, Systems Administrator John Yocum, and Senior Computing Specialist Elliot Norwood also taught sessions through the week.
The first day covered the basics of troubleshooting computer problems and how to protect against malware and viruses. The second day covered tools to use for data storage and simple scripting and automation. The third day focused on connecting to servers and working remotely as well as identified security concerns with storing data in the cloud. The fourth day delved more deeply into computing and data manipulation, and the last day was a how-to workshop on using R.
Sessions of the Boot Camp were video recorded using Panopto and available for department student to view on Portal (department log in required).