Contributed by cyberinfrastructure professionals (researchers, research computing facilitators, research software engineers and HPC system administrators), these resources are shared through the ConnectCI community platform. Add resources you find helpful!
RELION (REgularised LIkelihood OptimisatioN, pronounced rely-on) is a stand-alone software package developed by Sjors Scheres' group at the MRC Laboratory of Molecular Biology. It employs an empirical Bayesian approach for electron cryo-microscopy (cryo-EM) structure determination, specifically for refining multiple 3D reconstructions or 2D class averages.
This Udacity article listed the most frequently used R packages for data science and statistics. For each package, the article provided the link to its official documentation. It will be a great start point if you want to start your data science journey in R.
PyTorch is an optimized tensor computation library that supports automatic differentiation and is designed to accelerate deep learning research and production on both GPUs and CPUs. Built with flexibility and performance in mind, PyTorch provides a dynamic computational graph and a rich ecosystem of tools for building and deploying deep learning models.
This workshop will go into the different ways python packages can be managed in a cluster environment using conda and python virtual environments both in batch mode from the command line and with Jupyter Notebooks and Jupyter Lab on the cluster. The examples will be run on the GMU HOPPER Cluster.
This beginner-friendly guide introduces Retrieval-Augmented Generation (RAG), a technique to enhance Large Language Models (LLMs) by integrating external data sources. It covers the fundamentals of AI, LLMs, and RAG, providing step-by-step instructions, examples, and visual aids. The guide also discusses tools like Milvus, Faiss, and LangChain, offering a practical approach to building smarter AI systems.
This documentation contains introductory material on Python Programming for Digital Humanities and Computational Research. This can be a go-to material for a beginner trying to learn Python programming and for anyone wanting a Python refresher.
A tutorial entitled "How the Little Jupyter Notebook Became a Web App: Managing Increasing Complexity with nbdev" presented at SciPy 2023 in Austin, TX. This tutorial is hosted in a series of Jupyter Notebooks which can be accessed in the click of a button using Binder. See the README for more information.
Numpy is a python package that leverages types and compiled C code to make many math operations in Python efficient. It is especially useful for matrix manipulation and operations.
Network packet processing faces significant performance challenges due to kernel overheads. These issues have become more pronounced with the rapid growth of network traffic. To address these challenges, the Data Plane Development Kit (DPDK) was developed. DPDK bypasses the kernel and operates directly in user space, offering significant improvements in performance and latency for packet processing tasks. However, DPDK's steep learning curve presents a barrier to entry for developers and network administrators. In recent years, P4 has emerged as a language specifically designed for expressing packet processing data paths. Building on this development, P4-DPDK has been introduced as a new technology that bridges P4 and DPDK. It allows developers to create P4 code which is then translated into a DPDK pipeline, combining the expressiveness of P4 with the performance benefits of DPDK. This lab series offers a hands-on experience on the basics of P4-DPDK.
This repository contains information about Jupyter Widgets and how they can be used to develop interactive workflows, data dashboards, and web applications that can be run on HPC systems and science gateways. Easy to build web applications are not only useful for scientists. They can also be used by software engineers and system admins who want to quickly create tools tools for file management and more!
An ongoing collection of RSE training material, workshops, and resources. We are compiling this list as a starting point for future activities. We are especially seeking material that goes beyond basic research computing competency (e.g. what The Carpentries does so well) and is general enough to span multiple domains. Specific tools and technologies used only in one domain, or applicable to only one subset of computing (i.e. HPC) are typically too narrowly focused. When in doubt, submit it to be included or reach out and we’d be happy to discuss.
A master’s degree in data science helps prepare professionals to take the next career step. This article will focus primarily on data science, a graduate degree in this field, and a data scientist or data analyst career. With many employers preferring a master’s degree in data science for those seeking to fill roles as data scientists or analysts, we will discuss the data science master’s degree in detail.
Kaggle Learn is an accessible, hands-on platform offering free, beginner-friendly courses in data science and machine learning. Designed for learners ranging from novices to aspiring professionals, it provides interactive tutorials that focus on practical skills you can apply immediately.
Key Features:
Structured Micro-Courses: Each course is concise, typically taking 2–5 hours to complete, and includes interactive coding exercises.
Real-World Tools: Courses cover essential tools and libraries like Python, Pandas, Matplotlib, SQL, and TensorFlow.
Practice-Oriented Learning: The platform emphasizes learning by doing, allowing you to work directly with code and datasets.
No Setup Required: Courses run in your browser using Kaggle Notebooks, eliminating the need for local installations.
Community Support: Engage with a global community of learners and experts through discussions and shared notebooks.
The research paper provides an overview of various datasets that have been used to study fairness in machine learning. It discusses the characteristics of these datasets, such as their size, diversity, and the fairness-related challenges they address. The paper also examines the different domains and applications covered by these datasets.
This webinar series is an orientation to R. We start with an overview of R’s history and place in the larger data science ecosystem. Next, we introduce the R Studio user interface and how to access R’s excellent documentation. Finally, we present the fundamental concepts you need to use the R environment and language for data analysis. Along the way, we compare R script files (.R) to R Notebook (.Rmd) files and show how the features of R Notebook support better communication and encourage more dynamic engagement with statistical analysis and code. It is helpful to be familiar with tabular data analysis using statistical software, database tools, or spreadsheet programs.
Workshop materials, including setup directions and slides are available at https://github.com/CornellCAC/r_for_edu/ The Rstudio Cloud project used in the workshop is https://rstudio.cloud/project/4044219.
The "Fairness and Machine Learning" book offers a rigorous exploration of fairness in ML and is suitable for researchers, practitioners, and anyone interested in understanding the complexities and implications of fairness in machine learning.
R for Data Science is a comprehensive resource for individuals looking to harness the power of the R programming language for data analysis, visualization, and statistical modeling. Whether you're a beginner or an experienced data scientist, this guide will help you unlock the full potential of R in the realm of data science.
Computing Module: Introduces fundamental concepts and skills of Cyberinfrastructure (CI) and High-Performance Computing (HPC) to lower the barrier to becoming CI users in disaster management research. The module will cover the critical topics of CI and HPC with hands-on sessions.
Disaster Data Module: Introduces concepts of geospatial big data in disaster management. Students will learn how to access and process disaster data.
Geospatial Analytic Module: Introduces geospatial analytics skills to address real-world challenges in disaster management. The module will use the data introduced in the Disaster Data Module and cover various geospatial analytics topics such as geosimulation, spatial optimization, network analysis, terrain analysis, Geospatial Artificial Intelligence (GeoAI), social sensing, and CyberGIS.
Python has become a very popular programming language and software ecosystem for work in Data Science, integrating support for data access, data processing, modeling, machine learning, and visualization. In this webinar, we will describe some of the key Python packages that have been developed to support that work, and highlight some of their capabilities. This webinar will also serve as an introduction and overview of topics addressed in two Cornell Virtual Workshop tutorials, available at https://cvw.cac.cornell.edu/pydatasci1 and https://cvw.cac.cornell.edu/pydatasci2
HPCwire is a prominent news and information source for the HPC community. Their website offers articles, analysis, and reports on HPC technologies, applications, and industry trends.
Samtools is a suite of programs for interacting with high-throughput sequencing data, especially in the SAM/BAM format. It offers various utilities for processing, analyzing, and managing sequence data generated from next-generation sequencing (NGS) experiments. Samtools is widely used in bioinformatics and genomics research for tasks such as read alignment, variant calling, and data manipulation.
The realm of data science is one that onlookers regard with curiosity and respect. There are a lot of unknowns in this area of study that only recently became hugely relevant. It is important to get the facts on how expertise in data science is transforming the world. This article features what a bachelor’s degree means in today’s market and the future.