Report Finds Microsoft Excel Causes Errors in 20 Percent of Genomics Studies

iStock / iStock

Microsoft Excel, that ubiquitous tool for data crunching, has been playing an unexpected role in the scientific world. The program has been screwing with data in genomics studies. A new report in the journal Genome Biology estimates that around 20 percent of scientific papers published in leading genome-focused journals that include gene lists from Excel contain errors due to the program’s default autocorrect settings, Slate reports.

The problem is, several genes have symbols that look a lot like dates. The program has a tendency to convert gene symbols like SEPT2 (Septin 2) and MARCH1 (Membrane Associated Ring-CH-Type Finger) into what Excel thinks is proper date form—turning them into 2-Sept and 1-Mar instead. In some, SEPT2 became “2006/09/02.”

"Inadvertent gene symbol conversion is problematic because these supplementary files are an important resource in the genomics community that are frequently reused," the paper’s authors write. They reviewed the supplementary gene list Excel files from 18 journals, examining studies published between 2005 and 2015—Excel’s gene-typo issue was first reported in 2004—for date formatting within lists of genes. The analysis was performed by a program that flagged supplementary materials that seemed to be lists of genes, then searched them for date formatting. Out of more than 35,000 supplementary files, they confirmed 987 files with gene errors that were published as part of 704 studies.

Overall, 19.6 percent of papers in the 18 journals contained gene name errors caused by Excel’s autocorrect function, but some journals were worse than others. High-impact journals, typically the most respected outlets to publish research in, actually had more affected gene lists, which the researchers speculate may be because studies published in these journals are more likely to have larger and more numerous data sets.

The highest proportion of gene lists with errors (more than 20 percent) came from the journals Nucleic Acids Research, Genome Biology, Nature Genetics, Genome Research, Genes and Development, and Nature; conversely, the journals Molecular Biology and Evolution, Bioinformatics, DNA Research, and Genome Biology and Evolution showed errors in less than 10 percent of genomics papers.

While this isn’t the worst scientific error to end up in a journal, since it’s pretty clear that 2006/09/02 isn’t a gene symbol, it’s also fairly disturbing that this many papers could make it through the editing process without anyone noticing that they contained lists of nonexistent genes.

The researchers highlight Google Sheets as a potential alternative for Excel, because it doesn’t suffer from the same symbol-date mixup, and it seems that when you open Sheets documents in other programs like Excel, the data is protected from Excel’s default autocorrection. They suggest that journal editors and reviewers should look out for these errors, pasting gene name lists into blank files and sorting them so that any dates that have been mistakenly inserted will become apparent.

[h/t Slate]

Know of something you think we should cover? Email us at