Hadoop Infrastructure for Big Data Analytics

Project Title: Hadoop Infrastructure for Big Data Analytics

Project Lead’s Name: T M Rajkumar

Email: rajkumtm@miamioh.edu

Phone: (513) 529-4830

Please Choose the Primary Affiliation: FSB

Are There Other Project Team Members?: Yes

Other Project Team Member: Arthur Carvalho

Other Team Member Email Address: arthur.carvalho@miamioh.edu

Does this project focus on graduate student education or graduate student life?: No

If yes, please explain: No, it is focused primarily on Undergraduate, but the proposed solution would also be used with Graduate students. In particular students in ISA 636 (Data Management for Analytics) and in ISA 625 (Management Information Systems) classes will benefit.

Describe the problem you are attempting to solve and your approach for solving that problem.: Business organizations are increasingly data driven. Managers use data to guide them in making appropriate decisions in the running of organizations. In addition, many innovations in industry have their foundations in the use of big data. Big data is a term for data sets that are large and complex (unstructured) that traditional data processing applications are not able to deal with them adequately (Wikipedia). Training all our undergraduate and graduate business students to analyze, manage and work with big data is critical. Within the Farmer School of Business (FSB) we have heard this need articulated by many of our stakeholders (employers, advisory boards, alumni and others).

Within the ISA department, we have offered big data management courses as part of the analytics program for the last four years at the undergraduate and for two years at the graduate level. The undergraduate course (ISA 414 Big Data Management) is part of the ISA major within FSB, and part of the Business analytics co-major. The BA co-major is interdisciplinary and has students from Statistics and Geography in the College of Arts & Sciences. The course thus serves students from more than one division currently.

A significant part of the ISA 414 course is using a Hadoop system. So far, we have been providing the students a virtual machine (VM) within FSBs server and desktop systems. The problem is that we currently are not able to perform big data analytics, as we do not have the ability to load large data sets within the VMs. Hence, we are teaching big data techniques that students really apply on small data (perhaps 1-10 gigabytes of data). There is no significant usage of data large in size (100s of GB or a few terabytes).

We currently have fifty students in our classes each semester and creating that many VMs with large memory sizes (needed for Spark environments) and large disk space requirements is stretching our resources to our limits. The substantial strain on our resources also hampers our ability to support the data and analytic needs of the larger FSB student body and other divisions that use our virtual desktop cluster.

The problem of supporting big data is expected to worsen further as our analytics program within the school, and co-major (supporting arts and science students) is experiencing steady growth. In addition, in other classes such as our ISA 401 - Business Intelligence and visualization class, the faculty are starting to work with data sets of 1-5 gigabyte size.

FSB has introduced a "BQ" or Business quotient curriculum for all its incoming students as of 2016-17 academic year. BQ prepares the students to develop computational thinking along with creativity, ideation, and communication skills. Thus, there is a push to build on the computational thinking and communication skills in FSB core classes such as ISA 235 to include more data analysis and visualization skills using Tableau and other software. Our core class of ISA 235 serves about 900-1000 students each year.

Thus supporting the data (Big data) and analytic needs of FSB, and a growing student body within the analytic programs - requires us to have a FSB Hadoop cluster with a high speed shared disk. Such an academic Hadoop cluster would help us meet the academic needs of FSB students and also students from College of Arts and Sciences that take our Analytics co-major.

Using the cloud to support the classes is an alternative we have considered. We currently do use IBM Bluemix in one of our ISA 401 class this semester (Spring 2017). This is the only cloud vendor that provides student access free of charge for six months. Amazon which has provided educational grants for the use of the cloud in the past and for which I have obtained grants in the past has tightened its requirements and does not provide the grants readily for students now.

In addition, we obtain data from companies for analysis, which we use as part of our experiential classes such as ISA 496. Such data that faculty obtain for use have many confidentiality requirements, necessitating that we choose our own cluster, and host data that we can control access to.

Our approach to solving these problems is to provide a Hadoop cluster infrastructure that would support for storing big data, and perform big data analytics such as using SPARK in our classes. Such an infrastructure would help release computing constraints and enable better support for data driven decision making in all our classes.

The criteria state that technology fee projects should benefit students in innovative and/or significant ways. How would you describe the innovation and/or significance of your project?:

Significant: It enables the development of cutting edge analytic skills such as the use of machine learning techniques using SPARK and analyzing big data within the ISA and BA comajor. We graduate about 125-140 majors each year and about another 200 minors that are potentially affected each year.

More importantly, it enables all FSB (core class) students to improve their computational thinking in significant ways by allowing them to visualize and make decisions using big data. Thus it impacts about 1000 FSB students each year and all FSB students over a four year cycle.

In addition to FSB students in ISA 235, such a system would help introduce the ideas of Big Data in BUS104 (Computational thinking) core. Currently the students analyze data as part of their final project. Thus this project, has a potential of impacting another 1000 students each year.

It also helps address significant feedback that FSB receives from its stakeholders (employers, alumni, advisory boards) that request our students to be skilled in modern data analytic techniques. The analytics co-major is not only part of FSB, but is an interdisciplinary major with students drawn from the College of Arts and Sciences (both Statistics and Geography).

FSB has a philosophy of not blocking usage of its computing facilities to students from other divisions. Our Virtualpc environment is primarily for use by FSB students. We are aware that students from other division do frequently use our systems. In keeping with that philosophy, if we receive the grant, we would be happy to make available our excess capacity to other divisions for classroom teaching purposes.

How will you assess the project?: We will assess usage of the Hadoop cluster in our classes and report the usage of the system by FSB, in all its classes - including ISA 414, 401, 235 (our core class). We will report on the types of big data projects that it has enabled for use within our classes. We will report on the ability of our students to actually do Big data analysis by assessing the final projects that they complete in their ISA 414 classes.

Have you applied for and/or received Tech Fee awards in past years?: Yes

If funded, what results did you achieve?: One author (Rajkumar) has received two previous grants. The last grant was in Academic year 2011-12 a 15000 dollar grant to support the acquisition of an Apple server to help with developing Mobile application development in ISA classes. I am pleased to report, we modified our syllabi and we now teach and include mobile development as part of our ISA 403 - Web and Mobile Application development class.

Did you submit a final report?: Yes

What happens to the project in year two and beyond? Will there be any ongoing costs such as software or hardware maintenance, supplies, staffing, etc.? How will these be funded?: FSB will provide ongoing hardware maintenance. Staffing that we have within FSB is sufficient to handle the system.

Budget: Hardware

Hardware Title(s) & Vendor(s): Dell Rx630 Computers (Four in number) and EMC VNX 3200 Storage (18 TB)

Hardware Costs: $84,015.00

What is the total budget amount requested?: $84,000.00

Comments:

Computing Nodes: We are using a four-node cluster rather than an eight-node cluster recommended in Dell's reference architecture. I believe it will be sufficient for academic usage.
Dell Rx 630 Nodes - one with 512 GB ram and three with 128GB ram each (10971 + 6600*3) =
(Quote from Dell)
Storage from EMC VNX 3200 Storage (18 TB) - 53,314 (Quote from Roundrock consulting for EMC)
Total: 84,015
Note: This pricing does not include any cost as to networking requirements as I did not want to use HPCC network structure for Hadoop systems. I hope to be able to get FSB to fund this portion if we can get the major chunk of money for the Hadoop system, This budget is an approximate figure. Hence, we are requesting a sum of 84000.