Enron Becomes Unlikely Data Source for Computer Science Researchers

April 29, 2015 Matt Shipman 4-min. read

Image credit: Roscoe Ellis, shared under a Creative Commons license via Flickr. Click for more information.

For Immediate Release

Emerson Murphy-Hillemerson@csc.ncsu.edu 919.513.0234

Matt Shipmanmatt_shipman@ncsu.edu 919.515.6386

Computer science researchers have turned to unlikely sources – including Enron – for assembling huge collections of spreadsheets that can be used to study how people use this software. The goal is for the data to facilitate research to make spreadsheets more useful.

“We study spreadsheets because spreadsheet software is used to track everything from corporate earnings to employee benefits, and even simple errors can cost organizations millions of dollars,” says Emerson Murphy-Hill, an assistant professor of computer science at NC State and co-author of two new papers on the work.

However, there are relatively few public collections of spreadsheet data available for research purposes. For example, the collection currently used by most researchers consists of approximately 4,500 spreadsheets.

But researchers are now making two new collections available – one has 15,000 spreadsheets and the other has more than 249,000.

“In addition, we are publishing a technique that other researchers can use to collect additional spreadsheet data,” Murphy-Hill says.

The 15,000 spreadsheet collection consists entirely of spreadsheets collected from internal Enron emails, which were made public after the emails were subpoenaed by prosecutors.

“Our focus is on how users interact with spreadsheets,” Murphy-Hill says. “And these spreadsheets actually tell us a lot about how users represent and manipulate data.”

To assemble the second set of spreadsheets, called Fuse, the researchers developed their own technique to identify and extract spreadsheets from an online archive of over 5 billion webpages. Using their technique, the researchers collected 249,376 spreadsheets – including spreadsheets made as recently as 2014.

“Fuse used cloud infrastructure to search through billions of webpages to identify and extract the spreadsheets we write about in this paper,” says Titus Barik, a Ph.D. student at NC State, researcher at ABB Corporate Research, and lead author of the paper on Fuse. “Commodity cloud computing is incredibly exciting – searching those pages would take about seven years of continuous computation on a single computer, but the economies of scale with cloud computing allowed us to accomplish this with Fuse in only a few days.”

“And the fact that Fuse includes recent spreadsheets is a significant advantage over other spreadsheet collections, because the information is more up-to-date and reflects changes in Excel and other spreadsheet software,” Murphy-Hill says.

“Fuse is also more reproducible than other spreadsheet collections,” says Kevin Lubick, a Ph.D. student at NC State and co-author of a paper about Fuse. “Reproducibility is the cornerstone of good scientific research, but many existing spreadsheet collections are difficult to reproduce. Our technique can be used by anyone, and they’ll get the same results we get. But the results will also include any new spreadsheets made available since the last time the program was run.”

The Enron collection is the subject of a paper called “Enron’s Spreadsheets and Related Emails: A Dataset and Analysis,” which is being presented at the International Conference on Software Engineering May 20-22 in Florence, Italy. Lead author of the paper is Felienne Hermans of Delft University of Technology. The Fuse paper, “Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets,” is being presented at the Working Conference on Mining Software Repositories, May 16-17, in Florence, Italy. The Fuse paper was co-authored by NC State Ph.D. students Justin Smith and John Slankas.

-shipman-

Note to Editors: The study abstracts follow.

“Enron’s Spreadsheets and Related Emails: A Dataset and Analysis”

Authors: Felienne Hermans, Delft University of Technology and Emerson Murphy-Hill, North Carolina State University

Presented: May 20-22, International Conference on Software Engineering, Florence, Italy

Abstract: Spreadsheets are used extensively in business processes around the world and as such, a topic of research interest. Over the past few years, many spreadsheet studies have been performed on the EUSES spreadsheet corpus. While this corpus has served the spreadsheet community well, the spreadsheets it contains are mainly gathered with search engines and as such do not represent spreadsheets used in companies. This paper presents a new dataset, extracted for the Enron Email Archive, containing over 15,000 spreadsheets used within the Enron Corporation. In addition to the spreadsheets, we also present an analysis of the associated emails, where we look into spreadsheet specific email behavior. Our analysis shows that 1) 24% of Enron spreadsheets with at least one formula contain an Excel error, 2) there is little diversity in the functions used in spreadsheets: 76% of spreadsheets in the presented corpus only use the same 15 functions and, 3) the spreadsheets are substantially more smelly than the EUSES corpus, especially in terms of long calculation chains. Regarding the emails, we observe that spreadsheets 1) are a frequent topic of email conversation with 10% of emails either sending or referring spreadsheets and 2) the emails are frequently discussing errors in and updates to spreadsheets.

“Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets”

Authors: Titus Barik, ABB Corporate Research and North Carolina State University; Kevin Lubick, Justin Smith, John Slankas and Emerson Murphy-Hill, North Carolina State University

Presented: May 16-17, Working Conference on Mining Software Repositories, Florence, Italy

Abstract: Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called Fuse, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages. Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers. In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of Fuse with other corpora.