These resources have been contributed and “vetted” by the community of cyberinfrastructure professionals (researchers, research computing facilitators, research software engineers and HPC system administrators) that are participating in programs such as this one, that are supported by the ConnectCI community management platform. Additional Knowledge Base Resources are always welcome!
Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.
This webinar series is an orientation to R. We start with an overview of R’s history and place in the larger data science ecosystem. Next, we introduce the R Studio user interface and how to access R’s excellent documentation. Finally, we present the fundamental concepts you need to use the R environment and language for data analysis. Along the way, we compare R script files (.R) to R Notebook (.Rmd) files and show how the features of R Notebook support better communication and encourage more dynamic engagement with statistical analysis and code. It is helpful to be familiar with tabular data analysis using statistical software, database tools, or spreadsheet programs.
Workshop materials, including setup directions and slides are available at https://github.com/CornellCAC/r_for_edu/ The Rstudio Cloud project used in the workshop is https://rstudio.cloud/project/4044219.
Scikit-learn is free software machine learning library for Python. It has a variety of features you can use on data, from linear regression classifiers to xg-boost and random forests. It is very useful when you want to analyze small parts of data quickly.
Tableau is a popular and capable software product for creating charts that present data and dashboards that allow you to explore data. It is typically used to present business or statistical data, but can also create compelling visualizations of scientific data. However, scientific data is often generated or stored in formats that are not immediately accessible by Tableau. This seminar will explore the data formats that work best with Tableau and the available mechanisms for generating scientific data in (or converting it to) those formats so that you can apply the full power of Tableau to create the best possible visualizations of your data.
Numpy is a python package that leverages types and compiled C code to make many math operations in Python efficient. It is especially useful for matrix manipulation and operations.
Python has become a very popular programming language and software ecosystem for work in Data Science, integrating support for data access, data processing, modeling, machine learning, and visualization. In this webinar, we will describe some of the key Python packages that have been developed to support that work, and highlight some of their capabilities. This webinar will also serve as an introduction and overview of topics addressed in two Cornell Virtual Workshop tutorials, available at https://cvw.cac.cornell.edu/pydatasci1 and https://cvw.cac.cornell.edu/pydatasci2
Bioinformatics Toolbox provides algorithms and apps for Next Generation Sequencing (NGS), microarray analysis, mass spectrometry, and gene ontology. Using toolbox functions, you can read genomic and proteomic data from standard file formats such as SAM, FASTA, CEL, and CDF, as well as from online databases such as the NCBI Gene Expression Omnibus and GenBank.
This slices and videos introduced how to use K-Nearest-Neighbors method to impute climate data and how to use Bayesian Spatio-Temporal models in R-INLA to impute mortality data. The demos will be added soon.