Organizing and running bioinformatics hackathons within Africa: The H3ABioNet cloud computing experience

The need for portable and reproducible genomics analysis pipelines is growing globally as well as in Africa, especially with the growth of collaborative projects like the Human Health and Heredity in Africa Consortium (H3Africa). The Pan-African H3Africa Bioinformatics Network (H3ABioNet) recognized the need for portable, reproducible pipelines adapted to heterogeneous computing environments, and for the nurturing of technical expertise in workflow languages and containerization technologies. Building on the network’s Standard Operating Procedures (SOPs) for common genomic analyses, H3ABioNet arranged its first Cloud Computing and Reproducible Workflows Hackathon in 2016, with the purpose of translating those SOPs into analysis pipelines able to run on heterogeneous computing environments and meeting the needs of H3Africa research projects. This paper describes the preparations for this hackathon and reflects upon the lessons learned about its impact on building the technical and scientific expertise of African researchers. The workflows developed were made publicly available in GitHub repositories and deposited as container images on Quay.io.


2 3
report report report report report

Introduction
As an inherently interdisciplinary science, bioinformatics depends upon complementary expertise from biomedical scientists, statisticians and computer scientists 1 . This opportunity for collaborative projects also creates a need for avenues to exchange knowledge 1 . Hackathons, along with codefests and sprints, are emerging as an efficient means for driving successful projects 2 . They can be in the form of science hackathons that aim to derive research plans and scientific write up 3 , community-driven software development 4 , and data hackathons or datathons 5 . In addition to the scientific and technical outcomes, these intensive and focused activities offer necessary skills development and networking opportunities to young and early career scientists.
On the African continent, there is generally limited access to such events. However, with the growing capacity for Africans to generate genomic data, the need to analyze these data locally by African scientists, is also growing. H3ABioNet 6 , the Bioinformatics Network within the H3Africa initiative 7 , has invested in capacity building via different approaches 8 . The H3ABioNet Cloud Computing hackathon was a natural extension of the network's efforts in developing Standard Operating Procedures (SOPs) via its Network Accreditation Task Force (NATF) 9 ; aimed at building and assessing capacity in genomic analysis. This also follows other efforts by the H3ABioNet Infrastructure Working Group (ISWG) towards setting up infrastructure at various H3ABioNet Nodes at the hardware, software, networking, and staff level. The H3ABioNet Cloud Computing hackathon, therefore, provided an excellent opportunity to assess the computational skills capacity development of the network through training, learning and adoption of novel technologies ( Figure 1). These technologies included workflow languages for reproducible science, containerization of software, and creation of computational products that can be used in heterogeneous computing environments encountered by African and international scientists in the form of standalone servers, cloud allocations and High-Performance Computing (HPC) resources.
In this paper, we discuss the organization of the H3ABioNet Cloud Computing hackathon, the interactions between the participants, and the lessons learnt. Baichoo et al. 10 describe the technical aspects of the pipelines, whereas the code and pipelines themselves have been made publicly available via H3ABioNet's GitHub page in the following repositories: (h3agatk, h3abionet16S, h3agwas and chipimputation) as well as container images hosted on Quay.io.

Context, rationale and impact
For a healthy and strong scientific community, knowledge sharing activities, such as hackathons, are paramount. While instrumental to collaboration and efficient in developing solutions to shared problems, such activities are limited within Africa.
The H3ABioNet consortium aims to build a coherent and strong bioinformatics community within Africa that can technically support H3Africa projects for within-Africa analysis of African data. A network of > 27 nodes, H3ABioNet unites researchers from 15 African countries, in addition to a node in the US. Establishing a baseline where each node had sufficient computational infrastructure to carry genomics analyses was (and still) one of the key deliverables of the consortium. Consortium projects like Netmap helped to achieve this goal by evaluating network connectivity between the participating nodes and also led to upgrading infrastructure where warranted 11 .
Consequently, the primary value of the H3ABioNet cloud computing hackathon was to expose African scientists to the practical aspects of community development of computer code and to try to create a community around the maintenance of a set of workflows that implement methods that are useful to the H3Africa research community and beyond.
More pragmatically, the workflows developed in the hackathon serve as practical implementations of the Standard Operating Procedures for the H3ABioNet Accreditation Exercises, which are used to evaluate the capacity of African research groups in analyzing complex genomic datasets-like those being produced by various H3Africa research projects 9 . Success in taking one of the exercises is considered a landmark for African groups who are preparing to step into the existing gap between data production and data analysis, where the analysis is typically undertaken by First World groups.
Today, those implemented pipelines have been used for data analysis within the context of H3Africa projects, and/or

Amendments from Version 1
We would like to extend our sincerest gratitude to the reviewers for their comments and constructive criticism on our article entitled "Organizing and running bioinformatics hackathons within Africa: The H3ABioNet cloud computing experience". We have carefully and thoroughly evaluated all the comments and addressed them as necessary in the current version of our revised article. We do hope that we have tackled the issues raised in the comments to standards that meet your approval. Below is a brief summary of the main revisions to our article: We would like to once again express our gratitude and appreciation to the reviewers for their comments on our article. Please feel free to contact us for any further queries.

REVISED
incorporated into H3ABioNet training materials. Table 1 below highlights the significance of each developed pipeline, along with some technical notes about its implementation and availability. An extensive technical evaluation and trajectory of development is found in 10 .

H3ABioNet Cloud Computing Hackathon Activities
Prior to the H3ABioNet Cloud Computing hackathon, H3A-BioNet, via its Infrastructure Working Group (ISWG), formed a Cloud Computing task force to investigate cloud computing technologies, familiarize H3ABioNet members with current cloud implementations and gauge their suitability for H3Africa data analyses. The H3ABioNet Cloud Computing hackathon was one of the first deliverables of this task force, with the specific objective to test and implement four analysis workflows that can be ported on multiple computing platforms. Figure 1 shows this hackathon within the broader H3Africa context and provides a broad overview of the planning and execution of this activity, with details in the following subsections.

Pre-hackathon preparations
The computational pipelines put forward for development during the H3ABioNet Cloud Computing hackathon were identified based on the data being generated by different H3Africa projects and the SOPs used for the H3ABioNet Node Accreditation exercises. Reproducibility and portability were also identified as key features for the workflows, due to the heterogeneous computational platforms available in Africa. H3ABioNet Nodes that used or helped develop current H3A-BioNet workflows and SOPs were part of the planning team, as well as other nodes that had technically strong scientists who were willing to extend their skills.
In the course of planning for the H3ABioNet Cloud Computing hackathon, two technical areas were identified where additional expertise was required. These were containerization technology such as Docker, and the writing of genomic pipelines in popularly used workflow languages and newly emerging community-standards like Nextflow 12 and the Common Workflow Language (CWL) 13 , respectively. While expertise for Nextflow already existed within the network, two collaborators from outside Africa were interested to join the project given their expertise in cloud environments, containerization of code 14 and developing CWL 13 . They subsequently joined the planning and participated in the hackathon. In fact, they were also invited as guest speakers in the network's monthly webinar series where they shared some of their experiences in these areas with the broader H3ABioNet consortium.
The H3ABioNet Cloud Computing hackathon was announced on the internal H3ABioNet consortium mailing list as a call  for interested applicants and in some cases, individuals were invited based on their specific expertise. Most of the participants selected were early career scientists with strong computational skills, an understanding of genomic pipelines and willingness to work in teams. The pipelines for the Cloud Hackathon were divided into four "streams": 1) Stream A: variant calling from whole genome sequencing (WGS) and whole exome sequencing (WES) data (https://github.com/h3abionet/h3agatk), 2) Stream B: 16S rDNA Diversity Analysis (https://github.com/ h3abionet/h3abionet16S), 3) Stream C: Genome Wide-association studies (Illumina array data) (https://github.com/h3abionet/ h3agwas) and 4) Stream D: SNP Imputation and phasing using different reference panels (https://github.com/h3abionet/ chipimputation). Successful applicants were given a choice to select a project stream based on their skills and interest-or if unsure, assigned to a specific stream. Streams A and B decided to use CWL for their pipeline development, whereas Streams C and D opted to use Nextflow due to their prior experience using Nextflow.
Stream membership respected participants' own interests, but it was also sought to have steams of balanced composition. This included bioinformaticians with knowledge in the specific genomic analyses and computational tools required, strong computational skills to create the Docker containers and implement workflows, and strong system administration skills to assist with the installation of numerous software components as needed.
We also included bioinformaticians with experience in running the workflows or components of the workflows, and software developers who could assist with creating Docker containers, troubleshoot and implement workflow languages (CWL was still in draft-2 at the time of the hackathon, and some language features were added based on our experience).
To maximize the learning experience, upon selection, participants were given prerequisite tutorials and materials (Github, Nextflow, CWL, Docker and the SOPs) to go through. Communication and planning infrastructure in the form of Slack channels and Trello boards were created beforehand with all the participants added in order to allow them to brainstorm and share ideas with team members before the hackathon began (Table 2). Fortnightly planning meetings were held starting from 3 months in advance in order for hackathon participants to get involved in planning their proposed tools and to get to know one another and develop a working rapport before the start of the hackathon.
The hackathon ran in August 2016 and was hosted at the University of Pretoria Bioinformatics and Computational Biology Unit in South Africa. The choice of the hackathon venue was based on the availability of Unix/Linux desktop machines with the facility for sudo/root access enabling participants to install software and deploy Docker containers for testing. Besides the local machines, participants also had access to cloud computing platforms such as Azure and Amazon, Nebula (made available by the National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign), and the African Research Cloud (through a collaboration with the University of Cape Town eResearch initiative). After the hackathon, more testing was also done on EGI Federated Cloud resources (as a courtesy allocation to the University of Khartoum).

Hackathon week activities
The initial day of the H3ABioNet Cloud Computing hackathon was dedicated to introductions, expectations by the participants and practical tutorials covering the use of CWL, Nextflow and creation of Docker containers to ensure all participants had the same basic level of knowledge. The teams had a breakout session where overall milestones for the streams during the hackathon week were refined, tasks were identified and assigned to team members and Trello boards updated with the specific tasks. Each stream reported back on their progress and overall work plan for the coming hackathon days. For the remaining days of the hackathon, participants were split into their respective streams to work on developing and containerizing their pipelines as well as creating the related documentation.
To ensure a successful hackathon with concrete outcomes, the streams spent the first 30 minutes of each hackathon day reviewing their prior progress and updating their Trello boards and reporting to the group what they will be working on. At the end of the day, each stream provided a progress report to the whole group on what they had achieved, what they struggled with and what they will be working on. The start and end of day reporting proved useful as it allowed groups that had encountered and solved an issue to share the implemented solution with another stream, and for different streams to work together to solve any shared issues encountered, thus speeding up the development of the pipelines. Area experts and collaborators would switch between the streams to provide necessary technical expertise.
Communication during the hackathon was facilitated by Slack integration with Trello (for tasks management and progress tracking) and code developed was pushed to GitHub (for live code integration). Table 2 lists the various communication media used during the hackathon. Some groups also utilized Google docs for documenting their progress prior to migrating documentation into GitHub README files.
Remote participation in the hackathon was facilitated through the MConf conference system. One stream had a participant with very strong coding skills working remotely from the US; who managed to make progress on the corresponding workflow when the other group members were not working due to the big time difference between the USA and South Africa (SA). This ensured continuous development on the workflow when the team in SA would clock off and provide a to-do list which was accomplished by the participant from the US. Noticeable during the hackathon was the team spirit created and the increasingly later end time for the days (with most days ending at 8:30 pm as participants continued working after the different streams provided their daily reports). All participants wished for an extra day or two to complete their pipelines.

Post-hackathon activities
After the week-long hackathon at the University of Pretoria, members of each stream continued working on their respective pipelines communicating via Slack and Trello. Meetings were held over MConf every two weeks to report on the progress of each pipeline. Upon completion, each group handed their pipeline to other groups to test on different platforms, and thereby avoid bias in implementation and improve the documentation. Consequently, this facilitated the use of the four pipelines developed within H3Africa projects as highlighted in Table 1.

Discussion
The H3ABioNet Cloud Computing Hackathon was aimed at producing portable, cloud-deployable Docker containers for a variety of bioinformatics workflows including variant calling, 16S rDNA diversity analysis, quality control, genotype calling, and imputation and phasing for genome-wide association studies. The workflows developed in this hackathon benefited from workflow management systems, and further come with Docker recipe files that can be used to build container images when downloading images might be an issue. Thus, Dockerization provided a method to package and manage software, tools and workflows within a portable environment/container, similar to virtualization but with a smaller computing overhead compared to virtualization The novelty of the H3ABioNet Cloud Computing Hackathon was that all the participants selected were involved in the latter stages of the planning and the setting of some of the outcomes for the hackathon. Critical recommendations during the hackathon planning meetings were that the resulting Docker containers and pipelines developed should be compatible with heterogeneous African research compute environments with portability and good documentation being key. This is especially important considering the fact that access to Cloud computing environments within Africa is still in its infancy.
Hence, it was decided that development and testing of the pipelines should occur on a single machine, with the ability to be ported to a cluster or an HPC environment, and ultimately tested and deployed on cloud-based platforms (Amazon, Microsoft Azure, EGI FedCloud, IBM Bluemix, and the new African Research Cloud initiative).
Besides contributing solutions to African problems, three factors contributed to the success of this highly ICT-based activity in an African setting: 1) Almost all the communications tools used (Table 2) had equivalent apps that work right off a smartphone, a feature that many people within Africa (and less developed countries) tend to make use of 17 .
2) The used tools were complementary to each other, and integration was sought whenever possible (like between Slack and Trello).
3) The hackathon was timed at the end of the 4th year of the initial H3ABi-oNet round of funding. At that point, the consortium (via its Infrastructure Working group) had already invested in improving the computational infrastructure within the network 11 , including tools for regular communications and webinars 18 . In a sense, Table 2 also represents our vetted list of collaborative tools in the light of 4 years of feedback from the consortium.

Lessons learnt and concluding remarks
The opportunity to link people physically and focus solely on one project has been highly effective in providing the main outline and proof of concept outputs. However, once people were back home, continuing the tasks has been a challenge. Clearly defining the roles and commitment of all the participants in the papers reporting the results should encourage them to complete the work, and increase their accountability.
The communication and management tools used for this hackathon (Table 2) were important as these tools facilitated interaction between and across team members and enabled the participants to continue to work in a structured manner once back at their respective institutions, despite time zones differences.
The H3ABioNet Cloud Computing Hackathon has been an important milestone for the Network as it brought together people with various skills to work on focused projects. It signalled the shift from capacity building to utilizing the capacity developed in order to tackle problems specific to the heterogeneous African computing environments, as defined and implemented by the mostly African participants. Equally important, this hackathon was not done in isolation from the rest of the scientific community nor could it have succeeded without local collaborations. This aspect, i.e. welcoming input and actively seeking it when needed from outside the consortium, is key to truly empowering the local community.
As software packages and computing environments evolve with varying build cycles and new bioinformatics tools become available, we envision that hackathons to keep these pipelines current, adopt new technology implementations such as Singularity, and develop new workflows such as for RNA-Seq analysis will occur. The pipelines developed during the H3ABi-oNet Cloud Computing hackathon will be used for training and data analyses for intermediate level bioinformatics workshops, and for scientific collaborations requiring bioinformatics expertise for data analysis such as with the H3Africa genotyping chip and GWAS analyses. Future H3ABioNet hackathons would also provide an opportunity to utilize the skills of trained bioinformaticians at intermediate and advanced levels, who would not otherwise attend bioinformatics training workshops, to come together to derive practical solutions that are of benefit to the African and wider scientific community.

Data and software availability
All data underlying the results are available as part of the article and no additional source data are required.

Grant information
H3ABioNet is supported by the National Institutes of Health Common Fund [U41HG006941]. H3ABioNet is an initiative of the Human Health and Heredity in Africa Consortium (H3Africa) programme of the African Academy of Science (AAS. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 1.

7.
does it explain the meaning of the arrows. According to the AAS guidelines, "the legend should be sufficiently detailed so that it can stand alone from the main text". Additionally, it is not clear from first glance that "4 portable computational workflows" is the goal of the hackathon. This could be made more clear. Table 1 provides a nice overview of communication channels. Can you elaborate and add what tools you used for sharing documents (e.g notes, slides, pdf)?
On page 4, the authors state: "Vital in setting up the teams...". Does this paragraph refer to the expertise of the learners/participants or to the people who are leading the stream or both?
On page 5, the first sentence of the discussion provides the first clear statement of the aim of the hackathon (in my opinion). The goal is mentioned in the abstract and intro, but it isn't as clear. It is in this paragraph that I realized that you had working pipelines, but they were not "dockerized". After reading this, the figure made a lot more sense. I recommend revising the abstract and intro to make it clear what the starting point (5 bioinformatic workflows not in the cloud) and the endpoints (4 bioinformatic workflows in the cloud).
The paragraph on "Post-hackathon feedback and actions" seems incomplete or perhaps is mislabelled. This paragraph describes communication and work that extended past the hackathon, but it does not describe any assessment or feedback mechanisms that were used. Also, what happened after groups traded platforms? Was the documentation improved or was this simply the goal?
The article jumps from "Introduction" to "Discussion" without providing a clear a description of some of the results or outcomes. Is there a reason why some statistics regarding participation or progress toward the goal are not reported? If there is a reason why demographic information cannot be provided, that is fine, but a report about how much progress was made toward the goal of dockerization would be useful.
While I appreciate that you provided a link to the GitHub repo and Docker container, I highly recommend getting a DOI for these repositories. The AAS data guidelines page provides a list of providers. Some of your GitHub repositories already have versions, so it should be fairly easy to import these into Zenodo for a DOI. https://aasopenresearch.org/for-authors/data-guidelines.

Are sufficient details provided to allow replication of the method development and its use by others? Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions about the method and its performance adequately supported by the

Steffen Möller
Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock University Medical Center, Rostock, Germany The paper describes an in-person meeting in Africa to further develop workflows for genomic analyses and distribute that skill throughout the continent. As a Northern European I can only remotely assess the difficulties of computational biology and bioinformatics services in Africa in local areas. I can see how important such Hackathons are to edutRain research groups with difficulties to attend international conferences. And, with some experience in attending and organizing such events, I was very curious about what may be different in Africa. After all, in Europe there are communities having difficulties to commute, too. The European Union has extra funds for graduate students from Eastern European countries for instance. But there are also highly talented high school students who are under age or do not have the funds or scientific contacts to travel/be trained. So, if you have developed principles for African talents to overcome such obstacles to learn about such Hackathons and prepare for them, then I would be very much interested to see what we can adopt over here.
From what I then read in the article, there was not so much that the authors did differently, except that they seem to have done it particularly well. Table 1 describes a whole bunch of communication channels when typically it is a mere Wiki or co-editing site orchestrating the participants. Still, every attendee had to physically attend, except that the event was in Africa describing an African set of resources and there was external expertise flown in. I tend to think that here the event fell short of what could have been (and is often) done to invite remote participation. Also, one could videograph training material to support the further distribution of these Africa-specific analysis skills.
I am also a bit critical about the dominance of online resources that demand good internet connections as in the download of gigabytes of data: Docker. Here I wish more would be done to promote offline services. But I am biased, the authors NM and MRC know me as a contributor to Debian Med, MRC being a Debian contributor himself, which basically means that participants could take a DVD home and perform analyses of their samples without Internet access involving as many local-to-them machines as they like. For the workflows one doesn't need most of the complicated bits for which one needs computational expertise the article is describing. And that may have helped the post-Hackathon drop in participation, while, of course every event sees that drop and that is a main reason to have such a dedicated time to jointly develop our research environments in the first place.
Concerning the scientific results, I understand that there is a separate paper prepared. Still, can you say as much as if there are scientific papers out there that already employed the pipelines for their analysis? I mean, from the time before the Hackathon? This would emphasize that you are indeed redistributing a very current set of skills with practical acceptance in the community.
There is something else to it all. Hackathons form a social network of trust. And you need trust in locally well-described samples. The genetic diversity of Africa is a gem, but one needs to be aware of the demands for population stratification. And because of the harsh environment conditions, one can expect considerable batch effects on samples taken, for which the emphasis on equally collected control samples is important. You can certainly read about it all, but it will help to hear in person about that study that was ruined because cases and controls were kept separate and one box had the dry ice evaporated early -factor variation and confounding technical parameters cannot be communicated enough. Well-established workflows allow participants to analyse their local data independently, have extra parameters matching local concerns, with matching local controls, which will all improve the quality of the pan-African study at large. So, I do not think that anything performed for the organisation of this Hackathon was specific to Africa. In the contrary, there should have been some teleconferencing to it. The "initial day to get everyone up to speed" reminds me a bit more of a Summer School than a Hackathon, for which it is not uncommon that most participants already know each other for some time and would not need that day. Could you possible elude a bit more on how the work performed was structured? And how was the external expertise intermingled with your local needs for Africa? I would want to retitle the paper towards something like "Hackathon on workflows for genomics held in Africa". The H3ABioNet workflows are nicely accepted everywhere, I tend to think, so, why not have an H3ABioNet workflow Hackathon in Europe with some Africans flying in? Clarifying that a bit more may help the paper.

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 21 Apr 2018 , University of Illinois at Urbana-Champaign, Champaign, USA Victor Jongeneel Thanks very much to Steffen Möller for his pointed comments. It is certainly true that this hackathon did not differ in any major way from similar events held in high-income countries, and did not incorporate any features specific to the African context. Its primary value was to expose African scientists to the practical aspects of community development of computer code, and to try to create a community around the maintenance of a set of workflows that implement methods that are useful to the H3Africa research community and beyond.
There are aspects of the work that may not have come across in the paper. For one thing, the workflows implementing haplotyping, imputation, and GWAS analysis were based on work done in workflows implementing haplotyping, imputation, and GWAS analysis were based on work done in the framework of the H3Africa AWIGen project, and are in production for the analysis of data generated by this project. Similarly, the workflow for variant calling from WGS data was used in the analysis of 350 African genomes that has led to the design of a novel genotyping chip optimized for African populations, and the 16S rDNA sequence analysis was derived from work done to analyze bacterial populations present in leg ulcers of sickle cell patients in Nigeria. Therefore, all of the code developed during the hackathon was solidly anchored in existing genomic analysis projects in Africa.
Secondly, the workflows developed in the hackathon serve as practical implementations of Standard Operating Procedures for the H3Africa Accreditation Exercises, which are used to evaluate the capacity of African research groups to analyze complex genomic datasets being produced by its research projects (see Jongeneel et al, PLoS Comput Biol, ). PMC5453403 Success in taking one of the exercises is considered a landmark for African groups who are preparing to step into the existing gap between data production and data analysis, where the analysis is typically undertaken by First World groups.
It is true that the authors, including myself, could have done a better job at explaining how the hackathon and its products are anchored in the H3Africa research ecosystem. I hope that the above clarifies this.
As a final remark, while highly Internet-dependent tools were used extensively during the hackathon, to my knowledge none required a very high bandwidth. At least two of the participants attended remotely from North America, and were able to contribute substantially in part because of their time zone differences and asynchronous contributions.
No competing interests were disclosed.

Steffen Möller
Thank you for your constructive reply to my initial comments. This addresses all my reservations and I am looking forward for a revised upload. SM No competing interests were disclosed. Competing Interests: