Skip to content

Commit 820984f

Browse files
committed
Initial commit
0 parents  commit 820984f

9 files changed

+672
-0
lines changed

.gitattributes

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Auto detect text files and perform LF normalization
2+
* text=auto

.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
.DS_Store
2+
.Rproj.user
3+
.Rhistory
4+
/.quarto/
5+
/.luarc.json

README.md

+262
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
2+
<!-- README.md is generated from README.qmd. Please edit that file -->
3+
4+
# Quarto word count
5+
6+
- <a href="#why-counting-words-is-hard"
7+
id="toc-why-counting-words-is-hard">Why counting words is hard</a>
8+
- <a href="#using-the-word-count-script"
9+
id="toc-using-the-word-count-script">Using the word count script</a>
10+
- <a href="#using-as-an-extension" id="toc-using-as-an-extension">Using
11+
as an extension</a>
12+
- <a href="#using-without-an-extension"
13+
id="toc-using-without-an-extension">Using without an extension</a>
14+
- <a href="#how-this-all-works" id="toc-how-this-all-works">How this all
15+
works</a>
16+
- <a href="#example" id="toc-example">Example</a>
17+
- <a href="#credits" id="toc-credits">Credits</a>
18+
19+
## Why counting words is hard
20+
21+
In academic writing and publishing, word counts are important, since
22+
many journals specify word limits for submitted articles. Counting how
23+
many words you have in a Quarto Markdown file is tricky, though, for a
24+
bunch of reasons:
25+
26+
1. **Compatibility with Word**: Academic publishing portals tend to
27+
care about Microsoft Word-like counts, but lots of R and Python
28+
functions for counting words in a document treat word boundaries
29+
differently.
30+
31+
For instance, Word considers hyphenated words to be one word (e.g.,
32+
“A super-neat kick-in-the-pants example” is 4 words in Word), while
33+
`stringi::stri_count_words()` counts them as multiple words (e.g. “A
34+
super-neat kick-in-the-pants example” is 8 words with {stringi}).
35+
Making matters worse, {stringi} counts “/” as a word boundary, so
36+
URLs can severely inflate your actual word count.
37+
38+
2. **Extra text elements**: Academic writing typically doesn’t count
39+
the title, abstract, table text, table and figure captions, or
40+
equations as words in the manuscript.
41+
42+
In computational documents like Quarto Markdown, these often don’t
43+
appear until the document is rendered, so simply running a
44+
word-counting function on a `.qmd` file will count the code
45+
generating tables and figures, again inflating the word count.
46+
47+
3. **Citations and bibliography**: Academic writing typically counts
48+
references as part of the word count (even though IT SHOULDN’T).
49+
However, in Quarto Markdown (and all other flavors of pandoc-based
50+
markdown), citations don’t get counted until the bibliography is
51+
generated, which only happens when the document is rendered.
52+
53+
Simply running a word-counting function on a `.qmd` file (or
54+
something like the super neat
55+
[{wordcountaddin}](https://github.com/benmarwick/wordcountaddin))
56+
will see citekeys in the document like `@Lovelace1842`, but it will
57+
only count them as individual words (e.g. not “(Lovelace 1842)” in
58+
in-text styles or ‘Ada Augusta Lovelace, “Sketch of the Analytical
59+
Engine…,” *Taylor’s Scientific Memoirs* 3 (1842): 666–731.’ in
60+
footnote styles), and more importantly, it will not count any of the
61+
automatically generated references in the final bibliography list.
62+
63+
## Using the word count script
64+
65+
This extension fixes all three of these issues by relying on a [Lua
66+
filter](_extensions/wordcount/wordcount.lua) to count the words after
67+
the document has been rendered and before it has been converted to its
68+
final output format. [Frederik Aust (@crsh)](https://github.com/crsh)
69+
uses the same Lua filter for counting words in R Markdown documents with
70+
the [{rmdfiltr}](https://github.com/crsh/rmdfiltr) package (I actually
71+
just copied and slightly expanded [that package’s
72+
`inst/wordcount.lua`](https://github.com/crsh/rmdfiltr/blob/master/inst/wordcount.lua)).
73+
The filter works really well and [is generally comparable to Word’s word
74+
count](https://cran.r-project.org/web/packages/rmdfiltr/vignettes/wordcount.html).
75+
76+
The word count will appear in the terminal output when rendering the
77+
document. It shows three different values: (1) the total count, (2) the
78+
count for the document sans references, and (3) the count for the
79+
reference list alone.
80+
81+
``` text
82+
133 total words
83+
-----------------------------
84+
76 words in text body
85+
57 words in reference section
86+
```
87+
88+
There are two ways to use the filter: (1) as a formal Quarto format
89+
extension and (2) as a set of pandoc filters. You should definitely
90+
glance through the [“How this all works” section](#how-this-all-works)
91+
to understand… um… how it works.
92+
93+
### Using as an extension
94+
95+
Install the extension in your project by running this in your terminal:
96+
97+
``` bash
98+
quarto use template andrewheiss/quarto-wordcount
99+
```
100+
101+
You can then specify one of three different output formats in your YAML
102+
settings: `wordcount-html`, `wordcount-pdf`, and `wordcount-docx`:
103+
104+
``` yaml
105+
title: Something
106+
format:
107+
wordcount-html: default
108+
```
109+
110+
The `wordcount-FORMAT` format type is really just a wrapper for each
111+
base format (HTML, PDF, and Word), so all other HTML-, PDF-, and
112+
Word-specific options work like normal:
113+
114+
``` yaml
115+
title: Something
116+
format:
117+
wordcount-html:
118+
toc: true
119+
fig-align: center
120+
cap-location: margin
121+
```
122+
123+
### Using without an extension
124+
125+
Alternatively, if you don’t want to install the extension, download the
126+
two Lua scripts [`wordcount.lua`](_extensions/wordcount/wordcount.lua)
127+
and [`citeproc.lua`](_extensions/wordcount/citeproc.lua), put them
128+
somewhere in your project, and reference them in the YAML front matter
129+
of your document. Make sure you also disable citeproc so that it doesn’t
130+
run twice.
131+
132+
``` yaml
133+
title: Something
134+
format:
135+
html:
136+
citeproc: false
137+
filter: [citeproc.lua, wordcount.lua]
138+
```
139+
140+
## How this all works
141+
142+
Behind the scenes, pandoc typically converts a Markdown document to an
143+
abstract syntax tree (AST), or an output-agnostic representation of all
144+
the document elements. In AST form, it’s easy to use the [Lua
145+
language](https://pandoc.org/lua-filters.html) to extract or exclude
146+
specific elements of the document (i.e. exclude captions or only look at
147+
the references).
148+
149+
Quarto was designed to be language-agnostic, so {rmdfiltr}’s approach of
150+
using R to dynamically set the path to its Lua filters in YAML front
151+
matter does not work with Quarto files. ([See this comment from the
152+
Quarto team stating that you cannot use R output in the Quarto YAML
153+
header](https://github.com/quarto-dev/quarto-cli/issues/1391#issuecomment-1185348644).)
154+
155+
But it’s still possible to use the fancy {rmdfiltr} Lua filter with
156+
Quarto with a little trickery!
157+
158+
In order to include citations in the word count, we have to feed the
159+
word count filter a version of the document that has been processed with
160+
the [`--citeproc`
161+
option](https://pandoc.org/MANUAL.html#citation-rendering) enabled.
162+
However, in both R Markdown/knitr and in Quarto, the `--citeproc` flag
163+
is designed to be the last possible option, resulting in pandoc commands
164+
that look something like this:
165+
166+
``` sh
167+
pandoc whatever.md --output whatever.html --lua-filter wordcount.lua --citeproc
168+
```
169+
170+
The order of these arguments matter, so having
171+
`--lua-filter wordcount.lua` come before `--citeproc` makes it so the
172+
words will be counted before the bibliography is generated, which isn’t
173+
great.
174+
175+
{rmdfiltr} gets around this ordering issue by editing the YAML front
176+
matter to (1) disable citeproc in general and (2) specify the
177+
`--citeproc` flag before running the filter:
178+
179+
``` yaml
180+
output:
181+
html_document:
182+
citeproc: false
183+
pandoc_args:
184+
- '--citeproc'
185+
- '--lua-filter'
186+
- '/path/to/rmdfiltr/wordcount.lua'
187+
```
188+
189+
That generates a pandoc command like this, with `--citeproc` first, so
190+
the generated references get counted:
191+
192+
``` sh
193+
pandoc whatever.md --output whatever.html --citeproc --lua-filter wordcount.lua
194+
```
195+
196+
Quarto doesn’t have a `pandoc_args` option though. Instead, it has a
197+
`filters` YAML key that lets you specify a list of Lua filters to apply
198+
to the document:
199+
200+
format:
201+
html:
202+
citeproc: false
203+
filter:
204+
- '/path/to/wordcount.lua'
205+
206+
However, there’s no obvious way to reposition the `--citeproc` argument
207+
and it will automatically appear at the end, making it so generated
208+
references aren’t counted.
209+
210+
Fortunately, [this GitHub
211+
comment](https://github.com/quarto-dev/quarto-cli/issues/2294#issuecomment-1238954661)
212+
shows that it’s possible to make a Lua filter that basically behaves
213+
like `--citeproc` by feeding the whole document to
214+
`pandoc.utils.citeproc()`. That means we can create a little Lua script
215+
like `citeproc.lua`:
216+
217+
``` lua
218+
-- Lua filter that behaves like `--citeproc`
219+
function Pandoc (doc)
220+
return pandoc.utils.citeproc(doc)
221+
end
222+
```
223+
224+
…and then include *that* as a filter:
225+
226+
format:
227+
html:
228+
citeproc: false
229+
filter:
230+
- '/path/to/citeproc.lua'
231+
- '/path/to/wordcount.lua'
232+
233+
This creates a pandoc command that looks something like this, feeding
234+
the document to the citeproc “filter” first, then feeding that to the
235+
word count script:
236+
237+
``` sh
238+
pandoc whatever.md --output whatever.html --lua-filter citeproc.lua --lua-filter wordcount.lua
239+
```
240+
241+
Eventually [the Quarto team is planning on allowing filter options to
242+
get injected at different stages in the rendering
243+
process](https://github.com/quarto-dev/quarto-cli/issues/4113), so
244+
someday we can skip the wrapper filter and just do something like this:
245+
246+
format:
247+
html:
248+
filter:
249+
post:
250+
- '/path/to/wordcount.lua'
251+
252+
But that doesn’t work yet.
253+
254+
## Example
255+
256+
You can see a minimal sample document at [`template.qmd`](template.qmd)
257+
258+
## Credits
259+
260+
The [`wordcount.lua`](_extensions/wordcount/wordcount.lua) filter comes
261+
from [Frederik Aust’s (@crsh)](https://github.com/crsh)
262+
[{rmdfiltr}](https://github.com/crsh/rmdfiltr) package.

0 commit comments

Comments
 (0)