Most organizations/ projects are faced with the data quality dilemma. Analysis of project data may leave the relevant personnel with reservations regarding the authenticity of the data, the enumerators or even the project impacts. M&E and other management staff may even contemplate the possibility of re-doing the process for the purposes of validation. Here, we take a look at what data quality is, data quality dimensions, causes of poor data and; eventually, ways of improving data quality.
WHAT IS DATA QUALITY?
Data quality refers to a case whereby data that is collected for project purposes is sufficient, accurate, reliable, valid and acceptable. It is the ability of data to serve the purposes for which it was gathered. If data does not meet either of these criteria, then it may not be referred to as quality data.
Indeed, by the very nature of its definition, data quality is quite relative and subjective. While one may view data to be quality, the other might think otherwise. In this case, building consensus about the quality of data within an organization is critical. Nevertheless, it is possible that an organization establishes standards that can therefore be used to determine acceptable quality.
DIMENSIONS OF DATA QUALITY:
Quality data must meet the following key dimensions. They must be:
Valid/ Accurate: Validity refers to the ability of a tool or process to measure what it purports to measure. It is very easy to assume that the data tool is valid. However, this must be verified through scientific processes. For example, if a data tool intends to measure the use of chlorinated water in households, it must indeed measure that and not the use of treated water in households, since there are several methods of treating water.
Reliable: Reliable data is data that will show similar results on repeated attempts. One major way of testing data reliability is using the half split method. In the half split method, the data is split into two (for example, 1000 observations split into two datasets of 500 each) and a similar test run on both. Trends in results from both datasets should be similar (this is different from same) for the data to be considered quality data.
Complete: Without complete data, it difficult to rely on its results. Enormous missing data in a dataset is the surest way to conclude that the dataset is of poor quality.
Beneficial (useful): If data is not useful, then it probably is of poor quality. Data must serve the purpose for which it was collected, otherwise it is irrelevant. In other words, how beneficial the data is, is a good sign of its quality.
Acceptable: Again, if data is inacceptable by all the stakeholders, then that is a good sign that it is not quality. Data can only be quality if it is acceptable by all those who wish to use it. Yet again, it might be difficult to wholly agree on the quality based on this dimension, a certain level of acceptability may suffice.
CAUSES OF POOR DATA QUALITY
1. Errors of omission (Incomplete data): Incomplete data is one of the most prominent causes of poor data. This is usually in the form of missing data. When there are several missing bits of data, the quality may and indeed most of the time is compromised. In some cases a whole entry (observation) is rendered non-usable as a result of a missing component, especially for stratified data where it is a missing indicator may make it possible to group an observation into any stratum.
2. Errors of commission: Although a rare cause of poor data, additional and unnecessary information in data may compromise its quality. This is especially true where questions have skip patterns. In skip patterns, particular responses may not apply depending on responses provided in a previous question. However, if responses are recorded where skip patterns should have applied, then questions regarding data quality begin to arise.
3. Coding errors: This is yet another major cause for data quality concerns. Consider even a simple question such as gender. If response options were “1= Male” and “2= Female” and during coding, either at collection or entry, responses are entered as “1= Female” and “2= Male” then data quality is already compromised.
4. Poor collection methods: Even a simple factor such as the collection method may adversely affect the quality of data. Manual collection methods are known to contribute to more errors than digital methods. This is because factors such as illegible handwriting, carelessness and poor understanding of skip patterns may lead to poor quality data. However, since it is possible to set entry limits and skip patterns in digital data gathering, then the process quality is preserved.
5. Incompetent field staff: The type of field staff hired for data collection process has a huge impact on data quality. Not only should they be competent, they should also be honest and accountable. Otherwise, without honesty, field staff may give false data, mainly by falsifying data to meet targets.
6. Poor collection tools: In some instances, data quality issues boil down to poor collection tools. Either the tool does not capture what is intended or it is insufficient to capture the whole range of required information. Poor collection tools mainly affect the validity and reliability of the data.
SOME WAYS OF ADDRESSING DATA QUALITY ISSUES
1. Digital data gathering
One way of addressing data quality issues is to digital processes. For example, regarding errors of omission, through a digital process, the designer is able to specify whether a question is mandatory or optional. In this case, the enumerator has to fill in the mandatory. Additionally coding challenges are easily addressed by using digital data entry as the codes are preset prior to collection.
2. Data cleaning
In cases where both collection and entry is done manually, one way of addressing quality issues is to invest in data cleaning. This needs to be done both prior to entry and after entry. Data cleaning ensures that any issues with the data are identified and followed up with the enumerators.
3. Staff training
Investing in staff training is one of the surest ways of addressing data quality issues. By building the skills and knowledge of the staff involved, they not only become more involved in ensuring that data integrity is maintained but also are able to identify barriers to data quality and make attempts to avoid them.
4. Data policy into organizational culture
Organizations need to develop data quality policies and integrate them into the organizational culture in this way, each and every employee becomes an active agent of maintain data integrity within the organization and provide alternatives or solutions to data challenges.