-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathef.qmd
107 lines (59 loc) · 3.69 KB
/
ef.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Extracted Features {#sec-3}
## HTRC Extracted Features (EF) Dataset
- Open access under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/), and freely [downloadable](https://analytics.hathitrust.org/deriveddatasets)
- Structured data consisting of human-created (catalog) metadata and algorithmically-derived features
- Representing 17.1 million volumes, including those still under copyright (i.e., not quite in sync with the HathiTrust Digital Library )
- Linked-data compliant (JSON-LD)
- Complete documentation available at [https://go.illinois.edu/EF20_documentation](https://go.illinois.edu/EF20_documentation)
## What are "Extracted Features"?
Extracted features are computationally accessible data elements “extracted” (i.e., “derived”) from the HathiTrust Digital Library. They include:
- Volume- and page-level metadata
- Textual and statistical data and word-level metadata
- Extracted from the raw full text of volumes in the Digital Library
EF positions researchers to access the data they need and begin their analysis of text with some standard natural language & statistical preprocessing already done for them.
## Per-volume features
Excerpted from catalog metadata, including:
- Title
- Author
- Language
- Publication information
- Identifiers
- \[in future releases: Subjects\]

Diagram of an EF file.

JSON pretty print view of an EF file
Newly available as part of the TORCHLITE framework (but not yet included in older EF versions):
- Token counts aggregated by volume.
## Extracted Features API

The Extracted Features API allows programmatic access to the Extracted Features Files and not only enables HTRC's TORCHLITE dashboard, but also allows users to connect their own code notebooks, widgets, or applications.
The [EF API Documentation](https://htrc.stoplight.io/docs/ef-api/06db4dc572b49-ef) fully documents the available API calls and allows users to request sample calls in a number of different coding languages.
### API CALLS
- **GET** EF data for a volume by volume id
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}
- Check if a volume exists (**HEAD**)
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}
- **GET** volume metadata by volume id
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}/metadata
- **GET** subset of pages of volume by volume id
- https://data.htrc.illinois.edu/ef-api/volumes/{clean-htid}/pages
- Create workset (**POST**)
- https://data.htrc.illinois.edu/ef-api/worksets
- **DELETE** workset by workset id
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}
- **GET** workset
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}
- **GET** workset volumes by workset id
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/volumes
- **GET** workset volumes(aggregated) by workset id
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/volumes/aggregated
- **GET** workset volumes metadata by workset id o
- https://data.htrc.illinois.edu/ef-api/worksets/{workset-id}/metadata
## OBSERVABLE Notebooks
[OBSERVABLE Documentation](https://observablehq.com/@observablehq/documentation)
[OBSERVABLE Documentation (Shorter)](https://htrc.gitbook.io/torchlite/htrc-extracted-features)
- [Exploring API](https://observablehq.com/@jswatsch/torchlite-ef-api)
- [Word Cloud](https://observablehq.com/@jswatsch/torchlite-workset-word-cloud)
- [Contributor Map](https://observablehq.com/d/e69a3c5185393caa)
{fig-align="center"}