Machine Learning Pattern Recognition for Forensic Analysis of Detected Per- and Polyfluoroalkyl Substances in Environmental Samples
Dr. Tohren Kibbey | University of Oklahoma
Aqueous film forming foam (AFFF) formulations based on per- and polyfluoroalkyl substances (PFAS) were used extensively at Department of Defense (DoD) sites, and contamination is a concern at hundreds of locations. Because PFAS have also been widely used in non-AFFF applications, it is important that the DoD be able to distinguish detected PFAS contamination that originates from offsite non-AFFF sources from contamination that originates from onsite AFFF sources in order to delineate areas of potential liability. This proof-of-concept project will explore the use of modern machine learning algorithms to evaluate the probability that detected PFAS in environmental samples come from AFFF sources. The approach will use machine learning to search for recognizable patterns in PFAS-containing samples, with the objective of assigning probabilities that the contamination originates from specific sources.
Existing data from PFAS sites around the world will be used as inputs to train machine learning algorithms to distinguish PFAS contaminants from different origins. Open source machine learning libraries will be used to ensure that the results are accessible to practitioners and can be readily applied at DoD sites beyond those considered in this project. The project will be focused around three main tasks:
- Task 1. Data collection and preprocessing. An extensive, machine-readable database will be created from PFAS concentration data from water samples collected from around the world, specifically formatted to be used as an input to machine learning applications. The use of PFAS data from around the world will vastly increase the available training set size for machine learning.
- Task 2. Evaluation of machine learning algorithms for source identification. A range of supervised and unsupervised learning activities will be conducted to assess which methods exhibit the greatest performance for source identification. A tiered approach will incorporate increasing quantities of information, as needed, for accurate source identification. Work will be conducted to determine the individual PFAS whose concentrations are most critical to source identification.
- Task 3. Preliminary data mining and exploration of deep learning algorithms. Preliminary data mining will be conducted to evaluate what can be learned about PFAS environmental transformations from machine learning models. While many PFAS are highly resistant to degradation, others are known or suspected precursors, capable of environmental transformation to resistant forms such as perfluorooctanesulfonic acid (PFOS). Understanding these transformations is critical to understanding the environmental aging of PFAS formulations. Preliminary tests of deep neural networks will also be conducted for source identification.
The results of this project will provide the building blocks for a user-level tool that can be applied at DoD sites to distinguish between AFFF and non-AFFF sources. The tool would calculate probabilities that samples originated from AFFF or non-AFFF formulations. This project will provide significant information about the best approaches for applying machine learning to identification of AFFF sources, and provide information needed to understand prediction performance. Preliminary data mining will provide initial insights into environmental transformations, to guide future work in the area. (Projected Completion Date November 2020)