April 19, 2022 / Jennifer Shin

How Machine Learning Can Improve Worker Health Research: An FAQ

Industrial hygienists have been collecting personal air monitoring data to characterize worker exposure for decades. And while this data has been, and continues to be, invaluable for ensuring the health and safety of workers in the workplace, it is typically not linked to worker health or clinical data, such as injury and illness statistics, occupational exams (for example, hearing test results), and so on. If IH data were linked to worker health data, epidemiologists and other public health researchers and professionals could more accurately estimate exposures to workplace hazards when evaluating health outcomes. As explained by a 2017 paper in Current Environmental Health Reports, this is one of the foremost challenges in conducting occupational epidemiology studies. The following FAQ illustrates how the use of machine learning can potentially meet this challenge.

Is connecting IH data to worker health data difficult?

Yes. Connecting IH exposure data with worker health data is difficult for a couple of reasons, especially for large, global companies that must adhere to strict data privacy laws and cannot use employee names or unique IDs (such as social security numbers) across company databases. For these companies, industrial hygienists typically group workers together into similar exposure groups (SEGs), which differ from the job titles that are assigned to employees by Human Resources (HR). These SEGs are only used by IHs, whereas databases that contain health-related information for workers only refer to employee name or the HR-assigned job title.

What is machine learning and how can it be used to link IH SEGs to HR job titles?

Machine learning (ML) is a branch of artificial intelligence (AI) that utilizes features of data to train algorithms or classifiers to capture patterns of datasets. Basically, in ML, computers are used to look at datasets and identify patterns in the data through statistical and optimization methods.

There are four types of ML models: supervised, unsupervised, semi-supervised, and reinforcement learning. These are defined by the amount of training data they utilize. The idea for making the connection between IH SEGs and HR job titles is to use supervised learning, where the machine (that is, the computer) is taught through labeled data, or what is referred to as a "training dataset." The investigator (the epidemiologist or IH) creates a training dataset, which includes the desired inputs and outputs. In this case, the input data is the HR-assigned job title, and the output is the matching IH SEG. The computer looks for patterns in the data to help make predictions for unlabeled data. In other words, once the machine has identified patterns in the training dataset, it can predict which IH SEG should be assigned to an HR job title.

Since a training dataset is needed anyway, why not have investigators create all the linkages?

Because it’s time-consuming to do so by hand. For large companies with tens of thousands of employees and hundreds of work sites, it can take several years for these manual linkages to be created. A machine, on the other hand, can create linkages nearly instantaneously. The training dataset is usually developed using a subset of the data; once developed, the machine can do the rest. In addition, through feedback from the investigator, the algorithm can learn over time to make more accurate predictions. Of course, there are caveats: the training dataset must be accurate and have enough data to allow the computer to identify patterns.

How can we trust the ML output?

There are ways to evaluate the success of an ML algorithm. For the approach proposed here, the training dataset can be randomly split into two where half is used to train, the other half to test. Once the algorithm is developed using the training data, it can be used to predict IH SEGs for the HR job titles in the test dataset and then evaluated against the user-provided outputs for accuracy. Alternatively, the investigators can spot-check the predicted data for accuracy.

There are already a number of ways in which ML is being used to improve worker and public health. For example, as described by a recent paper in the Journal of Occupational and Environmental Hygiene, ML has been used to predict mask comfort for respiratory protective equipment to improve user experience and prevent discomfort-induced noncompliance. In another example discussed on the NIOSH Science Blog, ML has been used to code occupational surveillance data.

What’s next?

ExxonMobil Biomedical Sciences Inc. is collaborating with the School of Information at the University of Texas at Austin to develop an ML algorithm that will be used create linkages between ExxonMobil IH worker groups and HR job titles. Once this work is completed, the linkages will be used to evaluate specific health outcomes for workers, using IH data to more accurately estimate exposures.


IBM Cloud Learn Hub: “Machine Learning” (July 2020).

Jennifer Shin

Jennifer Shin is an exposure sciences research associate at ExxonMobil Biomedical Sciences Inc., a member of the AIHA Technology Initiatives Strategic Advisory Group, and past chair of the Emerging Digital Technologies Committee (the former Computer Applications Committee).


Re: Protecting PII

Paul, thank you for the feedback. You bring up a very good point about sitting down with the right folks within the company to understand and work through real and perceived obstacles. I'm interested in hearing about what you've done and other tips to improve worker health research. If you are willing and available, please email me at [email protected]

By Jennifer Shin on April 20, 2022 1:47pm
Protecting PII

First, protecting personally identifiable information (PII) is important. However, despite the impression that this is a barrier to analysis of health and exposure data, protecting PII is not that hard to do. The applicable Federal laws, HIPPA and OSHA recordkeeping standards quite explicitly authorize use of medical and exposure records for occupational health risk management. For extra protection of analytical files, generate a random 3- or 4-digit encryption key and add it to the company's employee number or SSN for a unique identifier instead of name. Truncate birth dates to year of birth. Collapse unique job titles (i.e., Plant Manager) into larger groups (Manager.) The important thing is to sit down with company IT and legal departments and work through actual requirements and standards rather than those you might imagine exist.

By Paul Wambach on April 20, 2022 11:19am

Add a Comment