Knowledge Base Resources

These resources have been contributed and “vetted” by the community of cyberinfrastructure professionals (researchers, research computing facilitators, research software engineers and HPC system administrators) that are participating in programs such as this one, that are supported by the ConnectCI community management platform. Additional Knowledge Base Resources are always welcome!

Add a Resource

fast.ai

fast.ai Homepage

Fastai offers many tools to people working with machine learning and artifical intelligence including tutorials on PyTorch in addition to their own library built on PyTorch, news articles, and other resources to dive into this realm.

ai machine-learning pytorch training

0 Likes

Type

website

Level

Regulated Research Community of Practice

Regulated Research Community of Practice

The daily news clearly shows the increasing threat to safety and privacy of data, personal as well as intellectual property. While the requirements such as DFARS 7012, HIPAA, and Cybersecurity Maturity Model Certification (CMMC) improve the consistency of data handling between agencies and contractors and grantees, it leaves academic institutions to figure out how to meet such requirements in a cost-effective way that fits the research and education mission of the institution. Most institutions, agencies, and companies act in isolation with one-off contract language to address data security and safeguarding concerns. Even though cybersecurity has a clear and uniform goal of protecting data, a onesize solution does not fit all academic institutions. By supporting this community with development of a community strategic roadmap, regular discussions and workshops, and a repository of generalized and specific resources for handling regulated research programs RRCoP lowers the barrier to entry for institutions handling new regulations.

community-outreach cybersecurity

0 Likes

Type

website

Level

Reinforcement Learning For Beginners with Python

Reinforcement Learning For Beginners with Python

This course takes through the fundamentals required to get started with reinforcement learning with Python, OpenAI Gym and Stable Baselines. You'll be able to build deep learning powered agents to solve a varying number of RL problems including CartPole, Breakout and CarRacing as well as learning how to build your very own/custom environment!

deep-learning machine-learning tensorflow training programming-best-practices python

0 Likes

Type

video_link

Level

The Official Documentation of Pandas

pandas documentation

Pandas is one of the most essential Python libraries for data analysis and manipulation. It provides high-performance, easy-to-use data structures, and data analysis tools for the Python programming language. The official documentation serves as an in-depth guide to using this powerful tool including explanations and examples.

plotting visualization

0 Likes

Type

documentation

Level

Numpy - a Python Library

NumPY Docs

Numpy is a python package that leverages types and compiled C code to make many math operations in Python efficient. It is especially useful for matrix manipulation and operations.

documentation big-data data-analysis deep-learning opencv pytorch tensorflow data-science

0 Likes

Type

tool

Level

Neurodesk

Neurodesk

Neurodesk provides a containerised data analysis environment to facilitate reproducible analysis of neuroimaging data. Analysis pipelines for neuroimaging data typically rely on specific versions of packages and software, and are dependent on their native operating system. These dependencies mean that a working analysis pipeline may fail or produce different results on a new computer, or even on the same computer after a software update. Neurodesk provides a platform in which anyone, anywhere, using any computer can reproduce your original research findings given the original data and analysis code.

psychology containers software-installation version-control

0 Likes

Type

website

Level

MATLAB bioinformatics toolbox

https://www.mathworks.com/products/bioinfo.html

Bioinformatics Toolbox provides algorithms and apps for Next Generation Sequencing (NGS), microarray analysis, mass spectrometry, and gene ontology. Using toolbox functions, you can read genomic and proteomic data from standard file formats such as SAM, FASTA, CEL, and CDF, as well as from online databases such as the NCBI Gene Expression Omnibus and GenBank.

visualization data-analysis bioinformatics genomics matlab

0 Likes

Type

tool

Level

Spack Documentation

Spack is a package manager for supercomputers that can help administrators install scientific software and libraries for multiple complex software stacks.

spack

0 Likes

Type

documentation

Level

Automated Machine Learning Book

Automated Machine Learning: Methods, Systems, Challenges

The authoritative book on automated machine learning, which allows practitioners without ML expertise to develop and deploy state-of-the-art machine learning approaches. Describes the background of techniques used in detail, along with tools that are available for free.

ai data-analysis deep-learning machine-learning neural-networks python r

0 Likes

Type

learning

Level

Introduction to Parallel Computing Tutorial

Introduction to Parallel Computing Tutorial

The tutorial is intended to provide a brief overview of the extensive and broad topic of Parallel Computing. It covers the basics of parallel computing, and is intended for someone who is just becoming acquainted with the subject .

parallelization

0 Likes

Type

learning

Level

Globus Documentation

Globus Documentation

Globus is a data transfer, sharing, automation, and discovery service used by hundreds of thousands of researchers to manage "big data" at universities, research labs, and national systems such as ACCESS. The Globus documentation website provides how-to guides, reference documentation, and examples for Globus's web application, command-line interface, Python software development kit (SDK), and APIs.

cloud-storage data-sharing data-management data-management-software data-transfer data-wrangling file-transfer globus dtn python data-security data-compliance federated-authentication secure-data-architecture

0 Likes

Type

documentation

Level

Fundamentals of Cloud Computing

Fundamentals of Cloud Computing

An introduction to Cloud Computing

cloud-computing

0 Likes

Type

website

Level

Slurm Scheduling Software Documentation

Slurm Documentation

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

cluster-management cluster-support slurm

0 Likes

Type

website

Level

Introductory Tutorial to Numpy and Pandas for Data Analysis

Numpy and Pandas for Data Analysis

In this tutorial, I present an overview with many examples of the use of Numpy and Pandas for data analysis. Beginners in the field of data analysis can find It incredibly helpful, and at the same time, anyone who already has experience in data analysis and needs a refresher can find value in it. I discuss the use of Numpy for analyzing 1D and 2D multidimensional data and an introduction on using Pandas to manipulate CSV files.

ai big-data data-analysis vectorization

0 Likes

Type

documentation

Level

Set Up VSCode for Python and Github

VSCode for Python plus Github Integration

VSCode is a popular IDE that runs on Windows, MacOS, and Linux. This tutorial will explain how to get set up with VSCode to code in Python. It will also provide a tutorial on how to set up Github integration within VSCode.

git python

0 Likes

Type

learning

Level

Anvil Home Page

Purdue University is the home of Anvil, a powerful supercomputer that provides advanced computing capabilities to support a wide range of computational and data-intensive research spanning from traditional high-performance computing to modern artificial intelligence applications.

anvil

0 Likes

Type

website

Level

Fine-tuning LLMs with PEFT and LoRA

Fine-tuning LLMs with PEFT and LoRA

As LLMs get larger fine-tuning to the full extent can become difficult to train on consumer hardware. Storing and deploying these tuned models can also be quite expensive and difficult to store. With PEFT (parameter -efficent fine tuning), it approaches fine-tune on a smaller scale of model parameters while freezing most parameters of the pretrained LLMs. Basically it is providing full performance that which is similar if not better than full fine tuning while only having a small number of trainable parameters. This source explains that as well as going over LORA diagrams and a code walk through.

faster optimization performance-tuning tuning

0 Likes

Type

video_link

Level

NERSC Training and Tutorials

A comprehensive collection of NERSC developed training and tutorial events, offered on regular schedules. All sessions are archived, including slide decks, video recordings, and software examples as are available. Some examples of past training and tutorial topics are listed below Deep Learning for Sciences Webinar Series BerkeleyGW Tutorial Workshop VASP Trainings Timemory Software Monitoring Tutorial, April 2021 HPCToolkit to Measure and Analyzing GPU Applications Performance Tutorial Totalview Tutorial NVidia HPCSDK - OpenMP Target Offload Training Parallelware Training Series ARM Debugging and Profiling Tools Tutorial Roofline on NVIDIA GPUs GPUs for Science events 3-part OpenACC Training Series 9-part CUDA Training Series

training

0 Likes

Type

learning

Level

Python Tools for Data Science

Python Tools for Data Science

Python has become a very popular programming language and software ecosystem for work in Data Science, integrating support for data access, data processing, modeling, machine learning, and visualization. In this webinar, we will describe some of the key Python packages that have been developed to support that work, and highlight some of their capabilities. This webinar will also serve as an introduction and overview of topics addressed in two Cornell Virtual Workshop tutorials, available at https://cvw.cac.cornell.edu/pydatasci1 and https://cvw.cac.cornell.edu/pydatasci2

ai machine-learning big-data data-analysis data-wrangling data-science training workforce-development python scikit-learn sql

0 Likes

Type

video_link

Level

Raftlib: Open Source library for concurrent data processing pipelines

RaftLib

Raftlib is an open-source C++ Library that provides a framework for implementing parallel and concurrent data processing pipelines. It is designed to simplify the development of high-performance data processing applications by abstracting away the complexities of parallelism, concurrency, and data flow management. It enables stream/data-flow parallel computation by linking parallel compute kernels together using simple right shift operators, similar to C++ streams for string manipulation. RaftLib eliminates the need for explicit usage of traditional threading libraries such as pthreads, std::thread, or OpenMP, which can lead to non-deterministic behavior when misused.

parallelization pthreads openmp

0 Likes

Type

tool

Level

Data Visualization Tools for Julia

Plots.jl is the most widely used plotting library for the Julia programming language. It's known for being especially powerful in its versatility and intuitiveness. It's limited set of dependencies and wide applicability across different graphics packages make it especially helpful in visualizing the results of your latest Julia implementation. However, there are still multiple options available for Julia programmers to visualize their datasets. The second link details a comparison against a variety of Julia packages.

plotting visualization julia

0 Likes

Type

tool

Level

InsideHPC

InsideHPC HomePage

InsideHPC is an informational site offers videos, research papers, articles, and other resources focused on machine learning and quantum computing among other topics within high performance computing.

ai machine-learning community-outreach

0 Likes

Type

website

Level

Warewulf documentation

Warewulf Documentation

Warewulf is an operating system provisioning platform for Linux that is designed to produce secure, scalable, turnkey cluster deployments that maintain flexibility and simplicity. It can be used to setup a stateless provisioning in HPC environment.

documentation administering-hpc distributed-computing hpc-cluster-architecture provisioning containers

0 Likes

Type

website

Level

Data visualization with Matplotlib

Guide to data visualization with matplotlib

Data visualization is a critical aspect of data analysis. It allows for a clear and concise representation of data, making it easier for users to understand and interpret complex datasets. One of the most popular libraries for data visualization in Python is Matplotlib. The included website aims to provide a brief overview of Matplotlib, its features, and examples/exercises to dive deeper into its functionalities.

plotting visualization

0 Likes

Type

website

Level

ACES: Charliecloud Containers for Scientific Workflows (Tutorial)

This tutorial introduces the use of Containers using the Charliecloud software suite. This tutorial will provide participants with background and hands-on experience to use basic Charliecloud containers for HPC applications. We discuss what containers are, why they matter for HPC, and how they work. We'll give an overview of Charliecloud, the unprivileged container solution from Los Alamos National Laboratory's HPC Division. Students will learn how to build toy containers and containerize real HPC applications, and then run them on a cluster. Exercises are demonstrated using the ACES cluster, a composable accelerator testbed at Texas A&M University. Students with an allocation on the ACES cluster can follow along with the ACES-specific exercises.

ACES TAMU scratch lammps tensorflow ondemand gpu nfs slurm bash training python containers

0 Likes

Type

learning

Level

Knowledge Base Resources

Topics

Programming Language

Science Domain

Skill Level

Content Type