Note
Version 0.1.0 PyPi coming soon.
The aim of this project is simple: create a basic Python library to explore and interact with open data sources.
This will improve and speed up how users:
- Navigate open data catalogues
- Find the data that they need
- Get that data into a format and/or location for further analysis
TBC...
pip install HerdingCats
or
poetry add HerdingCats
Note
Herding-CATs is currently under active development. Features may change as the project evolves.
Due to slight variations in how organisations set up and deploy their opendata catalogues, methods may not work 100% of the time for all catalogues.
We will do our best to ensure that most methods work across all catalogues and that a good variety of data catalogues is present.
Note
If the data seems worth it we will maintain methods for bespoke implementations that go beyond typical data catlogue implementations such as CKAN and OpenDataSoft.
Herding-CATs supports the following data sources by default:
Catalogue Name | Website | Catalogue Backend |
---|---|---|
London Datastore | data.london.gov.uk | CKAN |
Subak Data Catalogue | data.subak.org | CKAN |
UK Gov Open Data | data.gov.uk | CKAN |
Humanitarian Data Exchange | data.humdata.org | CKAN |
UK Power Networks | ukpowernetworks.opendatasoft.com | Open Datasoft |
Infrabel | opendata.infrabel.be | Open Datasoft |
Paris | opendata.paris.fr | Open Datasoft |
Toulouse | data.toulouse-metropole.fr | Open Datasoft |
Elia Belgian Energy | opendata.elia.be | Open Datasoft |
EDF Energy | opendata.edf.fr | Open Datasoft |
Cadent Gas | cadentgas.opendatasoft.com | Open Datasoft |
French Gov Open Data | data.gouv.fr | Bespoke API |
Gestionnaire de Réseaux de Distribution | opendata.agenceore.fr | Open Datasoft |
ONS Nomis | opendata.agenceore.fr | Bespoke API |
This Python library provides a way to explore and interact with CKAN, OpenDataSoft, and French Government data catalogues - as well as other bespoke sources.
HerdingCATs follows a Session -> Explorer -> Loader pattern.
It is structured around the folllwing main classes:
-
CkanCatExplorer
: For exploring CKAN-based data catalogues -
OpenDataSoftCatExplorer
: For exploring OpenDataSoft-based data catalogues -
FrenchGouvCatExplorer
: For exploring the French Government data catalogue -
NomisCatExplorer
: For exploring ONS data -
CkanLoader
: For loading CKAN catalogue data -
OpenDataSoftLoader
: For loading OpenDataSoft catalogue data -
FrenchGouvLoader
: For loading French Government catalogue data -
NomisLoader
: For loading ONS Nomis data
All explorer classes work with a CatSession
object that handles the connection to the chosen data catalogue.
import HerdingCats as hc
def main():
with hc.CatSession(hc.CkanDataCatalogues.LONDON_DATA_STORE) as session:
explore = hc.CkanCatExplorer(session)
if __name__ == "__main__":
main()
check_site_health()
: Checks the health of the CKAN siteget_package_count()
: Returns the total number of packages in a catalogueget_package_list()
: Returns a dictionary of all available packagesget_package_list_dataframe(df_type: Literal["pandas", "polars"])
: Returns a dataframe of all available packagesget_package_list_extra()
: Returns a list with extra package informationget_package_list_dataframe_extra(df_type: Literal["pandas", "polars"])
: Returns a dataframe with extra package informationget_organisation_list()
: Returns total number of organizations and their detailsshow_package_info(package_name: Union[str, dict, Any])
: Returns package metadata including resource informationshow_package_info_dataframe(package_name: Union[str, dict, Any], df_type: Literal["pandas", "polars"])
: Returns package metadata as a dataframepackage_search(search_query: str, num_rows: int)
: Searches for packages and returns resultspackage_search_condense(search_query: str, num_rows: int)
: Returns a condensed view of package informationpackage_search_condense_dataframe(search_query: str, num_rows: int, df_type: Literal["pandas", "polars"])
: Returns a condensed view with packed resources as a dataframepackage_search_condense_dataframe_unpack(search_query: str, num_rows: int, df_type: Literal["pandas", "polars"])
: Returns a condensed view with unpacked resources as a dataframeextract_resource_url(package_info: List[Dict])
: Extracts resource URLs and metadata from package info. This is used to get the resource URL and format for the CKAN data loader class.
import HerdingCats as hc
def main():
with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:
explore = hc.OpenDataSoftCatExplorer(session)
if __name__ == "__main__":
main()
check_site_health()
: Checks the health of the OpenDataSoft sitefetch_all_datasets()
: Retrieves all datasets from an OpenDataSoft catalogueshow_dataset_info(dataset_id)
: Returns detailed metadata about a specific datasetshow_dataset_export_options(dataset_id)
: Returns available export formats and download URLs
import HerdingCats as hc
def main():
with hc.CatSession(hc.FrenchGouvCatalogue.GOUV_FR) as session:
explore = hc.FrenchGouvCatExplorer(session)
if __name__ == "__main__":
main()
check_health_check()
: Checks the health of the French Government data portalget_all_datasets()
: Returns a dictionary of all available datasetsget_dataset_meta(identifier: str)
: Returns metadata for a specific datasetget_dataset_meta_dataframe(identifier: str, df_type: Literal["pandas", "polars"])
: Returns dataset metadata as a dataframeget_multiple_datasets_meta(identifiers: list)
: Fetches metadata for multiple datasetsget_dataset_resource_meta(data: dict)
: Returns metadata for dataset resourcesget_dataset_resource_meta_dataframe(data: dict, df_type: Literal["pandas", "polars"])
: Returns resource metadata as a dataframeget_all_orgs()
: Returns all organizations in the catalogue
import HerdingCats as hc
def main():
with hc.CatSession(hc.NomisDataCatalogues.ONS_NOMIS) as session:
explore = hc.NomisCatExplorer(session)
if __name__ == "__main__":
main()
get_datasets()
: Returns a list of all available datasetsget_dataset_info(dataset_id: str)
: Returns metadata for a specific datasetget_dataset_overview(dataset_id: str)
: Returns an overview of a specific datasetget_codelist_info(codelist_id: str)
: Returns metadata for a specific codelistgenerate_full_dataset_download_url(dataset_id: str, geography_template: ONSNomisGeographyTemplates | None = None)
: Generates a full dataset download URL
All resource loader classes (CkanLoader
, OpenDataSoftLoader
, FrenchGouvLoader
, NomisLoader
) support the following methods:
polars_data_loader()
: Loads data into a Polars DataFramepandas_data_loader()
: Loads data into a Pandas DataFrame
aws_s3_data_loader()
: Loads data into AWS S3 as either raw data (depending on the format) or parquet file (if you choose to load as parquet)
Note
We will be supporting DuckDB and MotherDuck soon.
import HerdingCats as hc
def main():
with hc.CatSession(hc.CkanDataCatalogues.HUMANITARIAN_DATA_STORE) as session:
explore = hc.CkanCatExplorer(session)
loader = hc.CkanCatResourceLoader()
# Get list of all packages
packages = explore.get_package_list()
# Get info for a specific package
data = explore.show_package_info("package_name")
# Extract resource URLs
resources = explore.extract_resource_url(data)
# Load into different formats
df_polars = loader.polars_data_loader(resources)
# Specify the desired format if you want to otherwise it will defaul to the first dataset in the list
df_pandas = loader.pandas_data_loader(resources, desired_format="parquet")
if __name__ == "__main__":
main()
import HerdingCats as hc
def main():
with hc.CatSession(hc.OpenDataSoftDataCatalogues.UK_POWER_NETWORKS) as session:
explore = hc.OpenDataSoftCatExplorer(session)
loader = hc.OpenDataSoftResourceLoader()
# Get export options for a dataset
data = explore.show_dataset_export_options("package_name")
# Load into Polars DataFrame (some catalogues require an API key)
df = loader.polars_data_loader(data, format_type="parquet", api_key="your_api_key")
if __name__ == "__main__":
main()
import HerdingCats as hc
def main():
with hc.CatSession(hc.FrenchGouvCatalogue.GOUV_FR) as session:
explore = hc.FrenchGouvCatExplorer(session)
loader = hc.FrenchGouvResourceLoader()
# Get all datasets
datasets = explore.get_all_datasets()
# Get metadata for a specific dataset
meta_data = explore.get_dataset_meta("dataset-id")
# Get resource metadata for a specific dataset
resource_meta = explore.get_dataset_resource_meta(meta_data)
# Load resource metadata into Polars DataFrame and specify the format of the data you want to load
df = loader.polars_data_loader(resource_meta, "xlsx")
if __name__ == "__main__":
main()
import HerdingCats as hc
def main():
with hc.CatSession(hc.ONSNomisAPI.ONS_NOMI) as session:
explore = hc.ONSNomisCatExplorer(session)
loader = hc.ONSNomisLoader()
data = explore.get_dataset_overview("NM_57_1")
print(data)
url = explore.generate_full_dataset_download_url("NM_57_1", hc.ONSNomisGeographyTemplates.LA_COUNTY_UNITARY_APR_23)
print(url)
df = loader.polars_data_loader(url)
print(df)
loader.upload_data(url, "test-herding-cats", "test-herding-cats-nomis", "raw", "local")
if __name__ =="__main__":
main()
Contributions are welcome! Please feel free to submit a PR.
For major changes, please open an issue first to discuss what you would like to change.