Web based Virtual Organic Chemistry Learning Helper

Project Title: Web-based Virtual Organic Chemistry Learning Helper

Project Lead's Name: Benjamin W. Gung

Project Lead's Email: gungbw@MiamiOH.edu

Project Lead's Phone: 513-529-2825

Project Lead's Division: CAS

Primary Department: Chemistry & Biochemistry

Other Team Members:

  • Scott Hartley, (Dept of Chemistry & Biochemistry)
  • Dominik Konkolewicz (Dept of Chemistry & Biochemistry)
  • Meredith Erb (Dept of Chemistry & Biochemistry)
  • John Femiani (Dept of Computer Science and Software Engineering)

List Departments Benefiting or Affected by this proposal:

  • Chemistry & Biochemistry
  • Chemical engineering
  • Biology
  • Psychology
  • Kinesiology
  • Nursing
  • Nutrition

Estimated Number of Under-Graduate students affected per year (should be number who will actually use solution, not just who is it available to): 575

Estimated Number of Graduate students affected per year (should be number who will actually use solution, not just who is it available to): 20

Describe the problem you are attempting to solve and your approach for solving that problem:

In organic chemistry classes (enrollment ~575 from the classes of CHM231 and CHM 241/242), the retention of students in the science majors remains a problem. In order to improve the retention rate and reduce the ratio of D, W, and F grades, we propose to develop a new web-based virtual organic chemistry learning helper.

Organic chemistry is one of these historically difficult courses (those with a high percentage of Ds, Fs, and Ws). The first reason that students struggle with this topic is that it involves a huge amount of experimental facts that are difficult for students to absorb. Organic chemistry is the study of organic compounds and materials. It is an enormous area of study that has applications in medicine, biology, engineering, and many other fields. The current estimate is that there are around 20 million different organic compounds that we know about.

The second reason organic chemistry is difficult for most students is that it has many different rules and also many exceptions. For example, a few weeks into sophomore organic chemistry courses, students will face the fundamental concept of reaction mechanisms involving SN1, SN2, E1, and E2 types and will be asked to identify the dominant reaction mechanism under a given set of conditions. Which reaction mechanism dominates in a reaction depends on many different factors including structures of the reactants and reagents, solvents, and temperature. Many students studying organic chemistry often become frustrated by these types of problems. Not only are there more than 20 million organic compounds, but there are different rules governing the reactions and properties of each of these compounds. A complicated part is how often the rules change. Depending on the environment in which a reaction takes place, the same starting material can lead to wildly different results.

Because of the difficulties in learning organic chemistry and the large student/faculty ratio, students cannot get enough timely help from faculty members and teaching assistants. Many students seek individual help by hiring a tutor, and so online tutoring for organic chemistry has become a booming business. Currently, an organic chemistry tutor charges anywhere from $35 to $65 an hour according to the website of wyzant.com (https://www.wyzant.com/organic_chemistry_tutors.aspx). Other websites have tutors asking for up to $115 per hour. Human tutors are expensive and have limited availability. Moreover, there is no guarantee of the quality of the hired tutors. Miami University has a supplemental instruction program (SI), which employs students who have successfully completed the course in the past. SI offers weekly recitation sections that do not provide individualized support. Tutoring support can be difficult to get through Rinella and is of uneven quality. Both the time and effort for a Miami student to hire an online tutor are extremely burdensome.

How can we help our students learn organic chemistry without hiring more faculty and teaching assistants? The answer to this problem is a web-based virtual organic chemistry learning helper. If completely developed, it will be available to help our students learn organic chemistry 24/7, and it would significantly benefit Miami students who take organic chemistry classes. We plan to take advantage of the rapidly evolving field of machine learning and artificial intelligence (AI). For any given topic, machine learning programs will be able to help find a pattern as long as there is sufficient data of adequate quality. We will develop a new AI-assisted website that allows students to ask organic chemistry questions and receive instant answers provided by the AI algorithm. Students will no longer have to wait or ask the professor for the answer. They can simply input their problems into the AI-backed website and get a suggestion. The AI-suggested solution does not provide all aspects of the correct answer but does provide the key points to the question. This should keep the students interested in the problem-solving process and encourage them to develop critical thinking skills.

The new website we will be building includes several layers of knowledge in organic chemistry. They should include (1) nomenclature of organic compounds; (2) reaction mechanism classification; and (3) organic reaction predictions. To name one of the 20 million known organic compounds, we have developed a script that searches and matches an input structure (through a JavaScript Molecular Editor) to the same compound in the PubChem database, which provides an IUPAC name for the compound (Naming-Organic Compounds). The PubChem database is maintained by the National Center for Biotechnology Information, a component of the National Library of Medicine, which is part of the United States National Institutes of Health.

To help students learn how to identify a reaction mechanism, the planned module will ask a series of questions concerning the structures of the reactants, reagents, and solvents used for the reaction and the temperature at which the reaction was carried out. Based on the input information, an AI model developed on datasets of textbook question banks should provide a suggestion that has a high probability of being the correct answer. This module needs support from Tech Fee funds.

When students input reactants and reagents, a neural machine translation model trained with a reaction dataset predicts reaction products. Students can compare their own predictions with the suggestion of the artificial intelligence program. Thus, an otherwise laborious memorization process may become fun and interesting. This can be accomplished by starting with chemical patent datasets that have millions of organic compounds and reactions. A neural machine translation model, which has been trained on this reaction dataset, works in the background to answer questions.

The third module also needs support from Tech Fee funds. The potential of reducing the time and cost for students to find help should have a strong beneficial impact on students learning organic chemistry at Miami University.

How would you describe the innovation and/or the significance of your project:

The mastery of organic compound structures and nomenclature as well as familiarity with organic reactions and mechanisms are required for success in organic chemistry courses. For that reason, it is important that these concepts are reinforced through practice problems. The AI-generated answers will help students with their homework for solving the problems in their textbooks. The algorithm-generated answers also enhance student understanding of structures and reactions as well as mechanisms and will provide students with a modern learning experience.

Two different AI models will be used for reaction mechanism categorization and organic reaction predictions, respectively. For the categorization/nomenclature of organic reaction mechanisms, we will be using the open source machine learning library Scikit-learn (see an entry in Wikipedia). It features various classification, regression and clustering algorithms and is designed to interoperate with the Python numerical and scientific library NumPy and SciPy. The functional groups and the carbon frames in organic compounds can be correlated to their definition and nomenclature, which are required for the identification of reaction mechanisms in sophomore organic chemistry. A database for each type of structure and the functional group can be prepared by collecting structure data from organic chemistry textbooks. Student assistants can be trained to manually enter the data rows with the definition of the categories as the output (answers) and the structure data as the input (features). The resulting datasets are suitable for training a machine learning model in a so-called "supervised learning" in a classification model.

For the reaction predictions, we plan to use a neural network to facilitate a deep learning process. We propose to apply this artificial intelligence method to make organic reaction predictions and provide help with student learning. Several recent publications in the Journal of Chemical Science form the bases for our project. A pioneering idea was proposed by Nam & Kim (J. Nam and J. Kim, 2016, https://arxiv.org/pdf/1612.09529.pdf.), which describes a method of applying a deep learning model in natural language processing (NLP) to the prediction of organic chemical reactions. Soon after, two more papers by Schwaller et. al. improved upon the original proposal by Nam & Kim. They introduced a multi-head attention NLP model with better success in predicting complex reaction products (P Schwaller, et.al., Chemical science, 2018, 9 (28), 6091; and ACS Cent. Sci. 2019, 5, 1572−1583).

It should be pointed out that these developments are of great interest to the pharmaceutical industry, but none of these published reports had an educational component. We plan to adopt the NLP (natural language processing) based algorithm, using the Open-neural machine translation method in our development of a virtual organic chemistry helper. (see Open-NMT-py, https://github.com/OpenNMT/OpenNMT-py). This is an open-source library for deep learning in natural language processing, originally developed for translations between two different languages. Most of the publicly available reaction datasets were derived from the patent mining work of Lowe (Lowe, D. M. Ph.D. thesis, University of Cambridge, 2012), where the reactions were described using a text-based representation called SMILES.

This is a relatively small (lower quality) data set available for free, https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873. Commercially a larger, cleaner and better-annotated data set (currently of around 13M reactions) is for sale by the software company, NextMove, as a product called Pistachio ($14K for a perpetual site license): https://www.youtube.com/watch?v=mQ82niDhWMY. Pistachio contains the reaction SMILES for 13 million reactions extracted from the patent literature. We have performed some preliminary work with the free dataset during the summer of 2021. Because the reaction dataset was limited and was derived from patent literature, the model did not perform well with some simple textbook reactions such as SN1, SN2, E1, and E2 reactions, especially with small molecules. In this project, we request to purchase a perpetual site license for Miami University of the Pistachio dataset, which includes reactions not only from US Patents but also from European and WIPO patents.

During the development of this project, student assistants will be trained to manually enter textbook reactions to add to the Pistachio dataset. This involves using the JavaScript Molecular Editor (JSME) or Ketcher software to draw reactions and copy the drawings as SMILES formula. The SMILES formulae are letters and numbers which can be processed by the NLP program. Using the deep learning neural network from Open-NMT-py with the new database, the resulting model should be able to generate correct answers to most textbook problems.

The AI-generated answers should help students learn organic chemistry and reduce the need to hire a human tutor, and should provide students with a modern learning experience. Preliminary machine learning website under construction: Benjamin Gung Website.

How will you assess the success of the project?

Retention in related Chemistry Courses will be used as one of the assessment tools. To assess retention in affected courses, we will compare the withdrawal rate from CHM231, CHM 241/242, and persistence into the second-semester course with baseline data from previous academic years. Beginning in the second year, we will also track the number of students who have taken CHM231, CHM 241/242, who remain in or switch to a major or Thematic Sequence in Chemistry or another STEM discipline.

The new AI-assisted website for CHM231, CHM 241/242, will be tried out first by graduate TAs to assess the viability and timing of the tools. Starting in the fall of 2022, this project will be implemented in CHM231, CHM 241/242 courses when students have started their organic chemistry courses. During this project, all our available TAs will help introduce the new learning tools to undergraduate students in the courses mentioned above. Evaluation of this new tool will use a survey for students to fill out to determine the extent to which the project meets its overall goals. Based on feedback from the students, the implementation will be revised to prepare for subsequent years in CHM231, and CHM 241/242.

No, we do not plan to buy the updated dataset every year.

Financial Information

Total Amount Requested: $23,675

Budget Details:

  • Dataset Pistachio: $14,000
  • Summer support for two student workers: $6,000
  • Dell Alienware Aurora R12: (NVIDIA® GeForce RTX™ 3090 24GB GDDR6X, Dell website price $3,675)
  • Miami Buyway punch out to Dell website. $3,675

Please address how, if at all, this project aligns with University, Divisional, Departmental or Center strategic goals:

Retention of STEM majors and reduction in D, W, and F rates in science courses are goals for the Department of Chemistry and Biochemistry and the College of Arts & Sciences. This project is certainly aligned with these two goals.