|
| 1 | + |
| 2 | +<!-- README.md is generated from README.qmd. Please edit that file --> |
| 3 | + |
| 4 | +# Quarto word count |
| 5 | + |
| 6 | +- <a href="#why-counting-words-is-hard" |
| 7 | + id="toc-why-counting-words-is-hard">Why counting words is hard</a> |
| 8 | +- <a href="#using-the-word-count-script" |
| 9 | + id="toc-using-the-word-count-script">Using the word count script</a> |
| 10 | + - <a href="#using-as-an-extension" id="toc-using-as-an-extension">Using |
| 11 | + as an extension</a> |
| 12 | + - <a href="#using-without-an-extension" |
| 13 | + id="toc-using-without-an-extension">Using without an extension</a> |
| 14 | +- <a href="#how-this-all-works" id="toc-how-this-all-works">How this all |
| 15 | + works</a> |
| 16 | +- <a href="#example" id="toc-example">Example</a> |
| 17 | +- <a href="#credits" id="toc-credits">Credits</a> |
| 18 | + |
| 19 | +## Why counting words is hard |
| 20 | + |
| 21 | +In academic writing and publishing, word counts are important, since |
| 22 | +many journals specify word limits for submitted articles. Counting how |
| 23 | +many words you have in a Quarto Markdown file is tricky, though, for a |
| 24 | +bunch of reasons: |
| 25 | + |
| 26 | +1. **Compatibility with Word**: Academic publishing portals tend to |
| 27 | + care about Microsoft Word-like counts, but lots of R and Python |
| 28 | + functions for counting words in a document treat word boundaries |
| 29 | + differently. |
| 30 | + |
| 31 | + For instance, Word considers hyphenated words to be one word (e.g., |
| 32 | + “A super-neat kick-in-the-pants example” is 4 words in Word), while |
| 33 | + `stringi::stri_count_words()` counts them as multiple words (e.g. “A |
| 34 | + super-neat kick-in-the-pants example” is 8 words with {stringi}). |
| 35 | + Making matters worse, {stringi} counts “/” as a word boundary, so |
| 36 | + URLs can severely inflate your actual word count. |
| 37 | + |
| 38 | +2. **Extra text elements**: Academic writing typically doesn’t count |
| 39 | + the title, abstract, table text, table and figure captions, or |
| 40 | + equations as words in the manuscript. |
| 41 | + |
| 42 | + In computational documents like Quarto Markdown, these often don’t |
| 43 | + appear until the document is rendered, so simply running a |
| 44 | + word-counting function on a `.qmd` file will count the code |
| 45 | + generating tables and figures, again inflating the word count. |
| 46 | + |
| 47 | +3. **Citations and bibliography**: Academic writing typically counts |
| 48 | + references as part of the word count (even though IT SHOULDN’T). |
| 49 | + However, in Quarto Markdown (and all other flavors of pandoc-based |
| 50 | + markdown), citations don’t get counted until the bibliography is |
| 51 | + generated, which only happens when the document is rendered. |
| 52 | + |
| 53 | + Simply running a word-counting function on a `.qmd` file (or |
| 54 | + something like the super neat |
| 55 | + [{wordcountaddin}](https://github.com/benmarwick/wordcountaddin)) |
| 56 | + will see citekeys in the document like `@Lovelace1842`, but it will |
| 57 | + only count them as individual words (e.g. not “(Lovelace 1842)” in |
| 58 | + in-text styles or ‘Ada Augusta Lovelace, “Sketch of the Analytical |
| 59 | + Engine…,” *Taylor’s Scientific Memoirs* 3 (1842): 666–731.’ in |
| 60 | + footnote styles), and more importantly, it will not count any of the |
| 61 | + automatically generated references in the final bibliography list. |
| 62 | + |
| 63 | +## Using the word count script |
| 64 | + |
| 65 | +This extension fixes all three of these issues by relying on a [Lua |
| 66 | +filter](_extensions/wordcount/wordcount.lua) to count the words after |
| 67 | +the document has been rendered and before it has been converted to its |
| 68 | +final output format. [Frederik Aust (@crsh)](https://github.com/crsh) |
| 69 | +uses the same Lua filter for counting words in R Markdown documents with |
| 70 | +the [{rmdfiltr}](https://github.com/crsh/rmdfiltr) package (I actually |
| 71 | +just copied and slightly expanded [that package’s |
| 72 | +`inst/wordcount.lua`](https://github.com/crsh/rmdfiltr/blob/master/inst/wordcount.lua)). |
| 73 | +The filter works really well and [is generally comparable to Word’s word |
| 74 | +count](https://cran.r-project.org/web/packages/rmdfiltr/vignettes/wordcount.html). |
| 75 | + |
| 76 | +The word count will appear in the terminal output when rendering the |
| 77 | +document. It shows three different values: (1) the total count, (2) the |
| 78 | +count for the document sans references, and (3) the count for the |
| 79 | +reference list alone. |
| 80 | + |
| 81 | +``` text |
| 82 | +133 total words |
| 83 | +----------------------------- |
| 84 | +76 words in text body |
| 85 | +57 words in reference section |
| 86 | +``` |
| 87 | + |
| 88 | +There are two ways to use the filter: (1) as a formal Quarto format |
| 89 | +extension and (2) as a set of pandoc filters. You should definitely |
| 90 | +glance through the [“How this all works” section](#how-this-all-works) |
| 91 | +to understand… um… how it works. |
| 92 | + |
| 93 | +### Using as an extension |
| 94 | + |
| 95 | +Install the extension in your project by running this in your terminal: |
| 96 | + |
| 97 | +``` bash |
| 98 | +quarto use template andrewheiss/quarto-wordcount |
| 99 | +``` |
| 100 | + |
| 101 | +You can then specify one of three different output formats in your YAML |
| 102 | +settings: `wordcount-html`, `wordcount-pdf`, and `wordcount-docx`: |
| 103 | + |
| 104 | +``` yaml |
| 105 | +title: Something |
| 106 | +format: |
| 107 | + wordcount-html: default |
| 108 | +``` |
| 109 | +
|
| 110 | +The `wordcount-FORMAT` format type is really just a wrapper for each |
| 111 | +base format (HTML, PDF, and Word), so all other HTML-, PDF-, and |
| 112 | +Word-specific options work like normal: |
| 113 | + |
| 114 | +``` yaml |
| 115 | +title: Something |
| 116 | +format: |
| 117 | + wordcount-html: |
| 118 | + toc: true |
| 119 | + fig-align: center |
| 120 | + cap-location: margin |
| 121 | +``` |
| 122 | + |
| 123 | +### Using without an extension |
| 124 | + |
| 125 | +Alternatively, if you don’t want to install the extension, download the |
| 126 | +two Lua scripts [`wordcount.lua`](_extensions/wordcount/wordcount.lua) |
| 127 | +and [`citeproc.lua`](_extensions/wordcount/citeproc.lua), put them |
| 128 | +somewhere in your project, and reference them in the YAML front matter |
| 129 | +of your document. Make sure you also disable citeproc so that it doesn’t |
| 130 | +run twice. |
| 131 | + |
| 132 | +``` yaml |
| 133 | +title: Something |
| 134 | +format: |
| 135 | + html: |
| 136 | + citeproc: false |
| 137 | + filter: [citeproc.lua, wordcount.lua] |
| 138 | +``` |
| 139 | + |
| 140 | +## How this all works |
| 141 | + |
| 142 | +Behind the scenes, pandoc typically converts a Markdown document to an |
| 143 | +abstract syntax tree (AST), or an output-agnostic representation of all |
| 144 | +the document elements. In AST form, it’s easy to use the [Lua |
| 145 | +language](https://pandoc.org/lua-filters.html) to extract or exclude |
| 146 | +specific elements of the document (i.e. exclude captions or only look at |
| 147 | +the references). |
| 148 | + |
| 149 | +Quarto was designed to be language-agnostic, so {rmdfiltr}’s approach of |
| 150 | +using R to dynamically set the path to its Lua filters in YAML front |
| 151 | +matter does not work with Quarto files. ([See this comment from the |
| 152 | +Quarto team stating that you cannot use R output in the Quarto YAML |
| 153 | +header](https://github.com/quarto-dev/quarto-cli/issues/1391#issuecomment-1185348644).) |
| 154 | + |
| 155 | +But it’s still possible to use the fancy {rmdfiltr} Lua filter with |
| 156 | +Quarto with a little trickery! |
| 157 | + |
| 158 | +In order to include citations in the word count, we have to feed the |
| 159 | +word count filter a version of the document that has been processed with |
| 160 | +the [`--citeproc` |
| 161 | +option](https://pandoc.org/MANUAL.html#citation-rendering) enabled. |
| 162 | +However, in both R Markdown/knitr and in Quarto, the `--citeproc` flag |
| 163 | +is designed to be the last possible option, resulting in pandoc commands |
| 164 | +that look something like this: |
| 165 | + |
| 166 | +``` sh |
| 167 | +pandoc whatever.md --output whatever.html --lua-filter wordcount.lua --citeproc |
| 168 | +``` |
| 169 | + |
| 170 | +The order of these arguments matter, so having |
| 171 | +`--lua-filter wordcount.lua` come before `--citeproc` makes it so the |
| 172 | +words will be counted before the bibliography is generated, which isn’t |
| 173 | +great. |
| 174 | + |
| 175 | +{rmdfiltr} gets around this ordering issue by editing the YAML front |
| 176 | +matter to (1) disable citeproc in general and (2) specify the |
| 177 | +`--citeproc` flag before running the filter: |
| 178 | + |
| 179 | +``` yaml |
| 180 | +output: |
| 181 | + html_document: |
| 182 | + citeproc: false |
| 183 | + pandoc_args: |
| 184 | + - '--citeproc' |
| 185 | + - '--lua-filter' |
| 186 | + - '/path/to/rmdfiltr/wordcount.lua' |
| 187 | +``` |
| 188 | + |
| 189 | +That generates a pandoc command like this, with `--citeproc` first, so |
| 190 | +the generated references get counted: |
| 191 | + |
| 192 | +``` sh |
| 193 | +pandoc whatever.md --output whatever.html --citeproc --lua-filter wordcount.lua |
| 194 | +``` |
| 195 | + |
| 196 | +Quarto doesn’t have a `pandoc_args` option though. Instead, it has a |
| 197 | +`filters` YAML key that lets you specify a list of Lua filters to apply |
| 198 | +to the document: |
| 199 | + |
| 200 | + format: |
| 201 | + html: |
| 202 | + citeproc: false |
| 203 | + filter: |
| 204 | + - '/path/to/wordcount.lua' |
| 205 | + |
| 206 | +However, there’s no obvious way to reposition the `--citeproc` argument |
| 207 | +and it will automatically appear at the end, making it so generated |
| 208 | +references aren’t counted. |
| 209 | + |
| 210 | +Fortunately, [this GitHub |
| 211 | +comment](https://github.com/quarto-dev/quarto-cli/issues/2294#issuecomment-1238954661) |
| 212 | +shows that it’s possible to make a Lua filter that basically behaves |
| 213 | +like `--citeproc` by feeding the whole document to |
| 214 | +`pandoc.utils.citeproc()`. That means we can create a little Lua script |
| 215 | +like `citeproc.lua`: |
| 216 | + |
| 217 | +``` lua |
| 218 | +-- Lua filter that behaves like `--citeproc` |
| 219 | +function Pandoc (doc) |
| 220 | + return pandoc.utils.citeproc(doc) |
| 221 | +end |
| 222 | +``` |
| 223 | + |
| 224 | +…and then include *that* as a filter: |
| 225 | + |
| 226 | + format: |
| 227 | + html: |
| 228 | + citeproc: false |
| 229 | + filter: |
| 230 | + - '/path/to/citeproc.lua' |
| 231 | + - '/path/to/wordcount.lua' |
| 232 | + |
| 233 | +This creates a pandoc command that looks something like this, feeding |
| 234 | +the document to the citeproc “filter” first, then feeding that to the |
| 235 | +word count script: |
| 236 | + |
| 237 | +``` sh |
| 238 | +pandoc whatever.md --output whatever.html --lua-filter citeproc.lua --lua-filter wordcount.lua |
| 239 | +``` |
| 240 | + |
| 241 | +Eventually [the Quarto team is planning on allowing filter options to |
| 242 | +get injected at different stages in the rendering |
| 243 | +process](https://github.com/quarto-dev/quarto-cli/issues/4113), so |
| 244 | +someday we can skip the wrapper filter and just do something like this: |
| 245 | + |
| 246 | + format: |
| 247 | + html: |
| 248 | + filter: |
| 249 | + post: |
| 250 | + - '/path/to/wordcount.lua' |
| 251 | + |
| 252 | +But that doesn’t work yet. |
| 253 | + |
| 254 | +## Example |
| 255 | + |
| 256 | +You can see a minimal sample document at [`template.qmd`](template.qmd) |
| 257 | + |
| 258 | +## Credits |
| 259 | + |
| 260 | +The [`wordcount.lua`](_extensions/wordcount/wordcount.lua) filter comes |
| 261 | +from [Frederik Aust’s (@crsh)](https://github.com/crsh) |
| 262 | +[{rmdfiltr}](https://github.com/crsh/rmdfiltr) package. |
0 commit comments