Data Provenance Initiative
The Data Provenance Initiative’s goal is to audit popular and widely used datasets with large-scale Legal and AI expert-guided annotation.
Development of a set of indicators necessary for tracing dataset lineage and understanding dataset risks.
The initiative’s initial focus on alignment finetuning datasets was decided based on their growing emphasis in the community for improving helpfulness, reducing harmfulness, and orienting models to human values.
The DPCollection annotation pipeline uses human and human-assisted procedures to annotate dataset Identifiers , Characteristics , and Provenance.
Data Provenance Explorer (DPExplorer)
"We release our extensive audit, as two tools: a data explorer interface, the Data Provenance Explorer (DPExplorer) for widespread use, and an accompanying repository for practitioners to download the data"
Collecting comprehensive metadata for each dataset required leveraging several sources including collection by linking to resources already on the web (e.g., dataset websites, papers, and GitHub repositories).
License Annotation Process
One of our central contributions is to validate the licenses associated with widely used and adopted datasets.