Accuracy and reliability are key in clinical trial data management. While manual data queries can help clean up the data, the process is time-consuming; automating that process improves things, harnessing advanced analytical technology can save even more time.
Outsourcing-Pharma recently connected with two data-management experts from contract research organization (CRO) Phastar to explore some useful tools and technology:
- Sheelagh Aird, senior director of data operations
- Jennifer Bradford, director of data science
OSP: Could you please explain what ‘data cleaning’ is?
S&J: Quality data is critical for a successful clinical trial. Data managers inspect the data to check for errors or inconsistencies using a combination of programmed checks and manual sight review. This is known as data cleaning.
As a trial progresses, data is entered into the clinical database from the different clinical trial sites and traditionally this data is then reviewed as the trial is ongoing by data management teams from the Sponsor/CRO. They look in detail across the data and if they identify any entries that are potentially inaccurate or non-uniform then they can send a query to the trial site querying the specific data point(s). The site would respond in a number of different ways including a change to a data point if it was found to have been incorrectly entered.
A variety of different data science techniques are being explored in the context of data cleansing and data management, to build usage, expertise, and drive efficiencies. These include the application of AI to deliver insights around the most problematic areas of data collection on a trial. Issues identified could relate to database design, a particular data source, local trends, training, and support requirements. The insights are interpreted along with a data management study expert to understand the context and act accordingly.
Additionally, a rule-based approach has been successfully implemented to highlight complex inconsistencies. In a human-machine hybrid approach, this method narrows down the search space for data reviewers so efforts can be prioritized.
OSP: Please tell us about some of the problems associated with cleaning up trial data—feel free to touch upon manual queries, automatic processes, etc.
S&J: The data cleaning process can be extremely time-consuming, particularly on larger, complex trials where large volumes of data can be collected. Data cleansing has historically been a very manual process, and this has left less time to focus on insightful analysis of the data.
Automated edit checks can be implemented within the clinical database, these detect errors as the data is inputted (for example out-of-range laboratory data or inconsistent dates) and will alert the site personnel before it is uploaded. The challenge is these automated checks can be limited by both the capability of the clinical database system and are also to be defined up front.
It is not always possible to identify in advance all the potential issues that may occur within the data. Automated checks are useful for errors in logic but cannot be used on free text, where there are often nuanced discrepancies or information that require further investigation.
From a data management perspective, data cleaning can be a very time-consuming and repetitive process. It can be subject to human error and, as has been previously demonstrated, may result in very little change to the underlying data.
The process of raising manual queries back to the trial site when data errors or inconsistencies are identified is also time-consuming, both for the data management team and for the clinical site in the provision of a response.
OSP: How can advanced tools like AI help streamline the process?
S&J: Promising results can be seen with advancements in AI being used to study the queries in context. AI assists in improving automated data checks as well as providing additional processes to identify potential issues earlier in a study. By applying ML to historic manual queries across different studies to understand common issues across and within studies, it could enable a more targeted approach to process optimization for clinical trial data cleaning.
When thinking about the application of these tools we should consider what is driving the requirement. Not only could the current processes be made more efficient and potentially effective, but also it is expected that in the next few years (if not already) both the volume and variety of data collected on a clinical trial will increase significantly. This means the traditional approaches of data cleaning are likely to become cumbersome – if not nearly impossible – or could incur significant costs. The use of advanced tools can support this evolution of data and help drive efficiencies in time and cost while ensuring the data is sufficient for accurate data analysis.
The specific application of these advanced tools will be driven by the particular problem or question, the availability of data, technical skills to implement the tools, and is likely to evolve over time. There is also the potential for a shift in the approach, moving away from the focus on the correction of data that has already been collected towards using tools and approaches to support the identification (and subsequent correction) of behaviors that lead to the data issues to start with.
We already are starting to see examples where advanced tools for monitoring and reviewing data from the start and throughout the clinical trial for example, through a risk-based approach, have huge benefits, certainly in terms of efficiency and more rapid availability of quality data.
OSP: Please tell us about a specific machine learning approach—how does it work, and how does it help trial teams end up with a more manageable data set?
S&J: One example we have from Phastar is utilizing an unsupervised machine learning approach called Latent Dirchlect Allocation (LDA) to identify themes in studies with large numbers of manual queries. This approach can provide insights into areas of data collection that may be problematic in a particular study.
Insight generation from this data can be challenging given it is unstructured, free-text, and often generated by multiple individuals on a single study with different writing and question styles. The queries can be complex and refer to relationships between data points across multiple forms and they don’t always result in a data change.
LDA is a generative probabilistic model for collections of discrete data, it takes as input multiple documents (in this example the free-text manual queries) and the number of topics we expect from those documents and outputs the distribution of topics within each document.
To do this, LDA considers each document as a bag-of-words (a list of words in any order) and applies the statistical model. The model can be thought of as essentially guessing what the topic and topic distribution of each document might be and comparing each of those guesses to the original documents to find the best guesses. This produces a list of topics, each described by a distribution of words, and from this word distribution, the model generates a topic distribution, which is how much a document (or in this example a query) is made up of each topic based on its words.
When LDA is applied to the manual queries, the team can use the resulting topic distributions to identify topics that are associated with high numbers of queries and use the word distributions within those topics together with other metadata to identify where to target their efforts. Their overall aim is to reduce the number of queries. Understanding why these queries are generated in the first place is the first step in that process.
OSP: I understand your clinical teams have used machine learning to provide more concentrated checks. Could you please tell us a bit more about that?
S&J: The LDA approach described above, can be used to identify where data issues are common during data collection on the clinical trial. Some of these may be addressed through additional training to trial sites or amendments to data collection forms, for example; however, some data issues must be identified and resolved either during or after the data has been collected. This can be achieved through changes to the automated edit checks that occur as the data is inputted into the clinical database, or as we have found in a number of instances where the checks are more complex, additional checks, outside of the clinical database, must be performed.
To address this, we have implemented a rule-based system that applies a series of rules, based on the insights generated by the LDA approach, which can identify potential data issues and present those to the data management team. This provides a targeted approach for the team, highlighting data that break those rules and allowing them to review those data points before further action is taken. This targeted approach reduces the data cleaning time and enables the team to mark issues as problematic, thereby creating a feedback loop to enable improvement of the underlying rules approach.
OSP: What advice do you have for data managers looking to step up their game with AI, machine learning, and other advanced analytical tools?
S&J: I think looking at examples of how AI, ML, and other advanced analytics have been applied successfully (or not!) both within the data management space and also from other areas is a good start. This will really help to understand the potential of these approaches, the type, quality, and volume of data and metadata that is generated and can be analyzed and critically an appreciation of the limitations, not just of the tools themselves but also how the data from which they are built could limit the overall result.
The key to the successful implementation of these tools is really about identifying the areas where these technologies could potentially make an impact. This is where data managers could really make a difference. They are best placed to identify those opportunities and, as a result, provide a better understanding of what is possible. Even at a high level with AI and other analytical and predictive tools, it will ensure the application of these technologies are effective and provide the most value-add for the data management community.
OSP: Anything to add?
S&J: In summary, I think we shouldn’t think of these technologies providing solutions overnight. It is more realistic that the implementation will happen more slowly, with an integration into standard data management practices and technology platforms to support data managers in their day-to-day work. The best approach, in my opinion anyway, would be to maximize the potential of the technology while making the best use of the specialist knowledge of the data management community.