These resources have been contributed and “vetted” by the community of cyberinfrastructure professionals (researchers, research computing facilitators, research software engineers and HPC system administrators) that are participating in programs such as this one, that are supported by the ConnectCI community management platform. Additional Knowledge Base Resources are always welcome!
VisIt is a prominent open-source, interactive parallel visualization and graphical analysis tool predominantly used for viewing scientific data. Its GitHub repository offers a detailed insight into the software's source code, documentation, and contribution guidelines. In particular, it offers useful examples on how it
A tutorial entitled "How the Little Jupyter Notebook Became a Web App: Managing Increasing Complexity with nbdev" presented at SciPy 2023 in Austin, TX. This tutorial is hosted in a series of Jupyter Notebooks which can be accessed in the click of a button using Binder. See the README for more information.
HPCwire is a prominent news and information source for the HPC community. Their website offers articles, analysis, and reports on HPC technologies, applications, and industry trends.
As LLMs get larger fine-tuning to the full extent can become difficult to train on consumer hardware. Storing and deploying these tuned models can also be quite expensive and difficult to store. With PEFT (parameter -efficent fine tuning), it approaches fine-tune on a smaller scale of model parameters while freezing most parameters of the pretrained LLMs. Basically it is providing full performance that which is similar if not better than full fine tuning while only having a small number of trainable parameters. This source explains that as well as going over LORA diagrams and a code walk through.
A tutorial on the effective use of Dask on HPC resources. The four-hour tutorial will be split into two sections, with early topics focused on novice Dask users and later topics focused on intermediate usage on HPC and associated best practices. The knowledge areas covered include (but are not limited to):
Beginner section
High-level collections including dask.array and dask.dataframe
Distributed Dask clusters using HPC job schedulers
Earth Science data analysis using Dask with Xarray
Using the Dask dashboard to understand your computation
Intermediate section
Optimizing the number of workers and memory allocation
Choosing appropriate chunk shapes and sizes for Dask collections
Querying resource usage and debugging errors
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It lets you store data in easy to manage and display data frames, with column names and datatypes.
Snakemake is a powerful and versatile workflow management system that simplifies the creation, execution, and management of data analysis pipelines. It uses a user-friendly, Python-based language to define workflows, making it particularly valuable for automating and reproducibly managing complex computational tasks in research and data analysis.
Purdue University is the home of Anvil, a powerful supercomputer that provides advanced computing capabilities to support a wide range of computational and data-intensive research spanning from traditional high-performance computing to modern artificial intelligence applications.
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
The Open Storage Network, a national resource available through the XSEDE resource allocation system, is high quality, sustainable, distributed storage cloud for the research community.
R for Data Science is a comprehensive resource for individuals looking to harness the power of the R programming language for data analysis, visualization, and statistical modeling. Whether you're a beginner or an experienced data scientist, this guide will help you unlock the full potential of R in the realm of data science.
The tutorial is intended to provide a brief overview of the extensive and broad topic of Parallel Computing. It covers the basics of parallel computing, and is intended for someone who is just becoming acquainted with the subject .
The Julia Programming Language is one of the fastest growing software languages for AI/ML development. It writes in manner that's similar to Python while being nearly as fast as C++, while being open source, and reproducible across platforms and environments. The following link provide an introduction to using Julia including the basic syntax, data structures, key functions, and a few key packages.
R GIS packages "rgdal", "rgeos", and "maptools" are package set to be archived and no longer supported by end of 2023. Many other R GIS packages are build on top of these packages, including "sp" and "raster". The packages recommended as replacement for "sp" is "sf" and the replacement for "raster" is "terra". Below are links to published articles regarding this transition. Additionally, I am including links to the documentation for the new packages recommended to be used "sf" and "terra".
An AWS Tutorial for Beginners is a course that teaches the basics of Amazon Web Services (AWS), a cloud computing platform that offers a wide range of services, including compute, storage, networking, databases, analytics, machine learning, and artificial intelligence.
Chameleon is an NSF-funded testbed system for Computer Science experimentation. It is designed to be deeply reconfigurable, with a wide variety of capabilities for researching systems, networking, distributed and cluster computing and security.
The Why & How seminar series is designed to introduce research assistants, graduate students, and postdoctoral and clinical fellows – really, anyone who is interested – to the many tools used in medical imaging. These include software tools and most of the major imaging modalities wielded by investigators (MRI, PET, EEG, MEG, optical, TMS and others). As the name of the series suggests, the talks cover both the reasons researchers might need a particular tool and the nuts and bolts of how to apply it. You can watch videos of the overviews below.
Fastai offers many tools to people working with machine learning and artifical intelligence including tutorials on PyTorch in addition to their own library built on PyTorch, news articles, and other resources to dive into this realm.
This article provides step-by-step instructions on how to build AirSim, a simulator for autonomous vehicles, on Linux. It includes both Docker and host machine setup options, along with details on building Unreal Engine, AirSim, and the Unreal environment. It also provides guidance on how to use AirSim once it is set up.
A guide for Duke OIT on how to advise users on using ACCESS and allocation credits to jetstream 2 for Duke University members. This can be used for non Duke members. Assumes the reader has basic knowledge of ACCESS.