Skip to content

The corpus of TOC, masthead, and ad index page image files and djvuXML files for Softalk magazine

License

Notifications You must be signed in to change notification settings

SoftalkAppleProject/datasets_toc-masthead-adindex

Repository files navigation

Softalk Magazine: The TOC, Masthead, and Advertiser Index Corpus - Work In Process

Project: The Softalk Apple Project
URL: http://www.SoftalkApple.com
Internet Archive collection: https://archive.org/details/softalkapple

This repository contains the Softalk magazine (Apple edition) Table of Contents (TOC), Masthead, and Advertisers Index Corpus as a dataset.

UPDATE (27 Feb 2017): We have uploaded the COMPLETE "ppg2leaf" map -- the metadata tuples that relate a Softalk issue's printed page to its respective "leaf" ID which is the unique component in the page's digitized image filename at the Internet Archive. In the ppg2leaf_map subdirectory, there are 48 interim datafiles in JSON format together with an Excel spreadsheet with all 9,547 leafs combined and providing a pivot table with breakout stats on the ratio, issue by issue, for actual (54%) vs. inferred (46%) print page numbers. The full ppg2leaf map is also provided in CSV format.

Video Updates on our ppg2leaf_ferret Project

FactMiners and The Softalk Apple Project have submitted two applied research papers to DATeCH2017. The second paper is all about the "ppg2leaf" challenge, that is, to recognize the importance of the foundational metadata tuple relating a document's printed page number (whether printed on the page or confirmed via human #GroundTruth metadata discovery and curation) with it digitized image filename known as "leaf" numbers at the Internet Archive.

Click on the images below to watch the YouTube videos.

ppg2leaf_ferret Video Update #1

This second video project update shows a much improved ppg2leaf_ferret app and includes a brief demonstration of using the ferret to develop the ppg2leaf map for the historic August 1981 issue of Byte magazine which is all about Smalltalk.

ppg2leaf_ferret Video Update #2

The images subdirectory contains the full-resolution individual page images for each of the 91 pages that contain one or more of the document structures included in this dataset. The djvu_text directory contains both the djvuXML and djvu text files generated by the Internet Archive during the stock digitization process. The magpage directory -- currently containing incomplete work-in-process files -- will collect the files in MAGAZINE and PAGE format as part of the (upper) #GroundTruth edition of the corpus.

The scripts directory includes Python scripts used to generate the text and CSV files for the masthead staff and Ad Index structures within this corpus. Please note for non-developers, both the staff and ad gathering scripts expect to be run via the command-line with the source and output directories (in this order) supplied as arguments. Developers extending these scripts will need to provide the required command-line parameters as part of their project run/debug configuration.

A manifest Excel spreadsheet -- and its equivalent in CSV, JSON, and XML formats -- is provided.

Softalk TOC/Masthead/AdIndex Dataset by The Softalk Apple Project is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Based on a work by the Citizen Scientists of The Softalk Apple Project (http://www.SoftalkApple.com) and by FactMiners' Citizen Scientists.

About

The corpus of TOC, masthead, and ad index page image files and djvuXML files for Softalk magazine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages