Solving pharma’s ‘big text’ problem with NLP

Solving-pharma-s-big-text-problem-with-NLP.jpg
(Image: Getty/ SARINYAPINNGAM) (Getty Images/iStockphoto)

There are about 27m articles in PubMed and 280,000 studies on ClinicalTrials.gov – so how does the industry begin to make sense of all this data?

Malaikannan Sankarasubbu, vice president, AI research, Saama Technologies is discussing the practical applications of natural language processing this week at the 10th Annual SCOPE Summit in Orlando, FL.

Ahead of his presentation, Outsourcing-Pharma (OSP) caught up with Sankarasubbu (MS) for some insights into pharma’s big text problem and how it is being addressed.

OSP: So, what is pharma’s ‘big text problem’?

MS: Pharma has a lot of unstructured text, and the surface of this data has hardly been scratched in terms of deriving insights. There are about 27 million articles in PubMed, and 280,000 studies in ClinicalTrials.gov. Language is difficult for AI to understand.

The reason for this complexity can be demonstrated by considering the difference between how we contextualize our thoughts into writing. This gives rise to a lot of permutation combinations that cannot be solved by traditional rules-driven programming.

The various amounts of permutations and combinations are so huge that a rules-based system cannot handle this effectively. This is where Artificial Intelligence systems based on Deep Learning step in.

How is natural language processing one solution to this?

MS: Computers were meant to crunch numbers. So, for a computer to understand text, it has to be converted to numbers. There were traditional techniques used to achieve this, like frequency-based conversion of a word to a number format, but in 2014 Google open-sourced an algorithm called Word2vec. This amazing algorithm works how we humans would to fill in the blanks. Let’s consider the following example:

I live in California and I _______ to New Jersey to work every week.

Humans would fill in the missing word as fly, commute, drive, or some brave soul might even say bike. We are good at looking at the surrounding words and predicting the center word. Word2vec works this way too and converts words to a number form that computers can understand. Similar to Word2vec there are quite a few embedding or converting a word to vector format techniques like ULMFit, ELMO, BERT, etc. that have come up in the last year or so.

We at Saama Technologies first trained an embedding model on pharma data to give it a specific domain understanding, which we then used as input to our later algorithms. Such application of AI offers major advantages to the life sciences industry.

Are there other ways in which the industry is addressing this issue? What appears to be the most promising path forward?

MS: There is a rules-driven approach, but it is not going to be effective due to the amount of permutation combinations that are possible. If there was an army of workers then it could be handled manually. But the most effective way to handle it is using an Artificial Intelligence system.

Approximately one million articles are published in PubMed every year. It is impossible for a company’s pharmacovigilance team to read through the documents manually, extract information on the drug and the event mentioned in an article, find the relationship between them, and then tag whether they are adverse or not.

In NLU terminology, this is a relationship-extraction problem and can be tackled as a reading comprehension question-and-answer problem. A scientific paper was published on this problem, with a good level of accuracy.

Are there ways to approach this problem differently, such as addressing disparate text sources? (i.e. how can/is the industry working to standardize how/where data is collected?) What challenges does this pose?

MS: Attacking the source of the problem is definitely a good way if you have control over how the data is collected. For example, with Patient Matching or finder solutions using unstructured text, it is very tough to get the doctors to enter the data in a specific format, since the goal of the medical practitioner is to treat the patients, not to enter the data in the format that can be used by pharma.

How are both the big text and big data problems affecting the progress of clinical research?

MS: Protocol design is a time-consuming process. Including the correct inclusion/exclusion criteria with the right set of values is a make-or-break juncture for a clinical trial. All of the 280,000 trials in ClinicalTrials.gov contain the inclusion/exclusion criteria that was used for running each specific trial.

If the protocol designer of a new trial was able to aggregate and access other trials with similar inclusion and exclusion criteria, as well as information on what happened to the trial, that data would be very useful. Similarity searches are not keyword-based but context-based, and occur in Vector space. AI systems can definitely help a protocol designer in this regard to a great extent.

Additionally, cohort building with inclusion/exclusion criteria is a laborious process. AI-based systems can extract the inclusion and exclusion criteria from a trial’s protocol design document and isolate entities like gender, age, sex, diagnosis, procedures, lab values, etc. then convert that to an SQL query (define) to find matching patients in electronic health record (EHR) databases. This approach can help identify patients not only for the sponsor trial but also for the current competitor trial and past trials too.

Matching the right patients for clinical trials is a complicated and time-consuming task. Most of the current matching algorithms use only structured data in EHR and Real World Evidence (RWE) to match patients.

EHR systems were not designed for matching patients to clinical trials; they were designed to optimize billing from health insurance companies.

Data model design is meant for that purpose. Really rich data is contained in physicians’ notes, but these notes cannot be tapped for patient matching unless the protected health information (PHI) is scrubbed from it.

Identifying and scrubbing a person’s name in a document is very hard, due to the variations in names. Consequently, it is very hard for a rules-based system to scrub. This is where AI-based Natural Language Understanding (NLU) systems can help. These state-of-the-art NLU systems can effectively scrub PHI elements from physicians’ notes and make them available for patient matching. Once the data is scrubbed, these notes can be mined for diagnosis, disease progression, procedures, etc.

As the industry collects more and more data, how will this challenge evolve? And thus also the technology to address it?

MS: The industry is definitely collecting a lot of data, and data is coming from disparate sources. Consolidating that data into a format that will be useful is a challenge. Artificial Intelligence will be applied to combine disparate sources of data into a usable format.

Then there is definitely a challenge of extracting meaningful insights from unstructured text; this is evolving at a rapid pace in the last year or so and serves as yet another area where AI can be leveraged to make an impact.