Data and query provenance -- i.e., reliably recreating data queries for replication

Our current work utilizes data from a variety of IPUMS data sources, including data derived from the full count (100% sample) census datasets.

We would like to provide readers of our paper with a convenient means of recreating our results directly from the raw data.

Currently, we provide a README along with our code that guides the end-user with stepping through the IPUMS interface to recreate our data extracts. However, this is a somewhat tedious process that introduces the potential for errors/confusion.

Question: Is there a means of recreating a query based on a script (for example, uploading a YAML file) and/or recreating a query based on a stable URL? Ideally, we would be able to provide researchers aiming to replicate our results with a single script, control file, and/or URL that they could then use to download the data from IPUMS.

Staff Answer


Jeff Bloem


This is a neat idea and a functionality we could probably think about adding someday. At the present time, however, the only way to gain access to IPUMS data is through the online data extract system.


Jun 25, 2018 - 09:05 AM

