Sponsoring groups are Ludwig Schmidt, Tim Althoff, and Pang Wei Koh. Student users are Josh Gardner, Mike Merrill, and Vinayak Gupta.
TablLib (paper, blog) is a dataset consisting of 627M tables and 867B tokens of context. The data is extracted from the Common Crawl and public repositories on GitHub, and is extracted from web pages, Excel spreadsheets, CSV files, SQLite databases, and more. Beyond solely including tables themselves, TabLib also includes context from the surrounding content related to a table, such as filenames, source URLs, and text surrounding the table. This makes TabLib one of the largest and most diverse tabular datasets ever publically released.
Some more figures:
- TabLib contains nearly 7 trillion cells of data.
- TabLib contains ~650 billion rows and ~8 billion columns.
Tables are stored as parquet files. The parquet files contain serialized Arrow bytes in the arrow_bytes column. Each parquet file contains potentially many individual serialized tables. To read these, you will need to deserialize the bytes:
Users who access the data should also apply for public, open credentialized access to the dataset on Hugging Face Datasets here.
TabLib is a collection of publicly available data. As noted in the TabLib preprint, it is noteworthy to mention that under U.S. copyright law, facts and data are not subject to copyright protection (see Feist v. Rural Telephone).