Training Program to Study Text Data Analytics on HPC Systems

Project Information

ai, natural-language-processing, python
Project Status: Complete
Project Region: CAREERS
Submitted By: Udi Zelzion
Project Email: jim.samuel@rutgers.edu
Anchor Institution: CR-Rutgers

Students: Tanya Khanna

Project Description

There is a tremendous increase in volumes of text data across multiple disciplines.
It has become necessary to develop easy to use research frameworks using high performance computing
(HPC) capabilities for research with text data, because it is near impossible to run analysis of text data on
even medium sized datasets. For example, an attempt to run sentiment analysis algorithms on a social
media text data file with just 100,000 records would fail on a computer with 16 B or less RAM.
Such frameworks need to be beginner friendly and user friendly, and need to customized to the Rutgers’
computing environments to benefit researchers, faculty, students and other users and stakeholders. This
will empower all relevant users to focus on the core aspects of their research rather than struggle with
HPC related technological challenges.
To bring this concept to effect at Rutgers University, we propose the development of standardized
processes for basic multidisciplinary natural language processing (NLP) analyses to support beginners
and current users of the Amarel system.
Our work will focus on preparing Jupyter Notebooks in Python for textual data analyses, NLP and textual
data visualization. We anticipate the production of materials which will help researchers at Rutgers.