Skip to content

support Dominion format #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tarheel opened this issue Aug 11, 2018 · 20 comments
Closed

support Dominion format #119

tarheel opened this issue Aug 11, 2018 · 20 comments

Comments

@tarheel
Copy link
Contributor

tarheel commented Aug 11, 2018

George's comment from #96: "I believe the guys at FairVote have the Dominion data from Santa
Fe."

@CalebKleppner CalebKleppner added this to the v 1.0 Submission milestone Aug 31, 2018
@HEdingfield HEdingfield removed this from the v 1.0 Submission milestone Aug 31, 2018
@CalebKleppner CalebKleppner added this to the v 1.0 Submission milestone Aug 31, 2018
@HEdingfield HEdingfield modified the milestone: v 1.0 Submission Aug 31, 2018
@tarheel
Copy link
Contributor Author

tarheel commented Jan 2, 2019

@gngilbert @CalebKleppner do you know how we can get this data?

@gngilbert
Copy link
Collaborator

I believe I have the data from Santa Fe. I will try to get it uploaded tomorrow and we can discuss it. I could not figure it out. We may have to talk to Dominion.

@gngilbert
Copy link
Collaborator

I think I have attached the Santa Fe CVR files here. Not sure how this works. If they are not attached, I can send them attached to an e-mail.

drive-download-20181214T175737Z-001.zip

@nealmcb
Copy link

nealmcb commented Apr 8, 2019

Thanks - fascinating.
Wow - that is such an awful way to deliver CVRs of ranked ballots: a bloated set of CSVs with 205 columns for perhaps 21 candidates, enormous column names over 40 chars each, and so much extraneous and badly-organized data. And you have to join at least 3 csv files together to see contest and candidate names!
I dare say I'd much prefer the json-format output that I expect they have (and used to produce this output).

@gngilbert
Copy link
Collaborator

That was my reaction as well and why we have not taken on the task of converting it to run in the RCVRC Tabulator. At some point, however, we need to do this and I am looking for recommendations as to how to do this in the most efficient manner.

@gngilbert
Copy link
Collaborator

Jon, should you bring David in on this issue? (I would but don't know how. Thanks.)

@tarheel
Copy link
Contributor Author

tarheel commented Apr 12, 2019

@davidryal

@gngilbert
Copy link
Collaborator

We'll keep this on the shelf for now.

@chughes297
Copy link
Collaborator

Pedro at FairVote has been working with the San Francisco CVRs I'm attaching to this post, and developed this process for converting the JSON in to a human readable format:
https://docs.google.com/document/d/1uR94xFn-oB3B_17lftP2gZkLZAGtvs6Wu5rw2vryDsE/edit?usp=sharing. Wanted to share in case it's useful as you guys get started on Dominion
CVR_Export_20191125163446.zip

@tarheel
Copy link
Contributor Author

tarheel commented Feb 10, 2020

More questions from @moldover:

do we need to handle multiple contests?
how are column headers parsed? there's a bunch of different text:
Original/Cards/0/Contests/0/Marks/0/Rank
Original/Cards/0/PaperIndex
Original/Cards/0/Contests/0/Marks/8/MarkDensity
do we need to interpret selections, using outstack condition manifest?

@gngilbert
Copy link
Collaborator

gngilbert commented Feb 11, 2020 via email

@tarheel
Copy link
Contributor Author

tarheel commented Feb 11, 2020

These are the same questions that Jon asked previously over email; I'm just reproducing them here to keep things organized. Presumably Keith has already shared them with Dominion folks.

@gngilbert
Copy link
Collaborator

gngilbert commented Feb 11, 2020 via email

@nealmcb
Copy link

nealmcb commented Feb 11, 2020

Thanks for digging out the data and a parsing method. I'm pretty sure I've parsed similar Dominion JSON files before in Python.

Also, as yet another approach, the RLA (SHANGRLA) audit of the election seems to have used the Javascript in this file to parse out just the votes on each ballot in a given contest:

https://github.com/pbstark/SHANGRLA/blob/master/ConvertCVRToRAIRE.html

The RAIRE format (for later processing) is a CSV file.
First line: number of contests.
Next, a line for each contest
Contest,ID,N,C1,C2,C3 ...
ID is the contest ID
N is the number of candidates in that contest
and C1, ... are the candidate id's relevant to that contest.
Then a line for every ranking that appears on a ballot:
Contest ID,Ballot ID,R1,R2,R3,...
where the Ri's are the unique candidate IDs.

Yielding this output file for Mayor

But I've only glanced at that stuff - I might have misinterpreted something there....

@CalebKleppner
Copy link
Collaborator

CalebKleppner commented Feb 17, 2020 via email

@gngilbert
Copy link
Collaborator

gngilbert commented Feb 17, 2020 via email

@tarheel
Copy link
Contributor Author

tarheel commented Feb 21, 2020

Here's a parser that @catrope wrote for the new format: https://github.com/catrope/sf-rcv/blob/master/parse-new-format.js

And the rest of the code in that repo handles the old format.

More explanation from him:

I apologize for the lack of documentation, so I'll briefly explain it here instead.
Download a ZIP file from the SF elections web site (e.g. this one)
Create a new directory and extract the ZIP file into it
Run my script in this directory, and capture its output in a file
So, in a terminal/shell:

$ wget https://www.sfelections.org/results/20191105/data/20191114/CVR_Export_20191114160248.zip
$ mkdir sf-20191114
$ cd sf-20191114
$ unzip ../CVR_Export_20191114160248.zip
$ node parse-new-format.js > reformatted.json

The attached screenshot illustrates what the output looks like. It's a big array, where every element is a ballot card, and every race is an array of choices. For RCV contests, the first element is the first choice, the second element the second choice, etc; for non-RCV contests (including measures), there is only one element. If a choice is null that means it was left blank (undervote), if the choice is itself an array of multiple values that means multiple choices were selected (overvote). So in the attached example, the District Attorney rankings were 1) blank, 2) Loftus, 3) overvote for both Tung and Boudin, 4) Dautch, and their Mayor rankings were 1) Breed, 2) Pang, 3) Ventresca, 4) blank, 5) Zhou, 6) Jordan+Robertson overvote.

I haven't yet adapted my RCV code to ingest this format, but it shouldn't be too much work, and the format should be relatively easy to deal with for other scripts as well. Since the data for non-RCV races is also all there, you should also be able to compute correlations between contests that appear on the same card (e.g. local measures: how many people voted Yes on A but No on E or vice versa, and where were they located?). One thing I want to look at at some point is the geographic distribution of Nancy Tung's transferred votes: Tung->Loftus, Tung->Boudin and Tung->exhausted were each over 30%, and I'm curious to see if those three groups are concentrated anywhere in particular. I also want to look at the second choices of Boudin and Loftus voters.

@HEdingfield
Copy link
Contributor

Latest update: I believe we mostly have this issue addressed with the closing of #404, #406, #407, #408, and #415.

Remaining related open issues (which could probably supersede the need to keep this one open): #434, #437, #438.

@moldover @tarheel, could you please look closely over this issue and file any other necessary issues to address any last loose ends here? Then I think we should be good to close it.

@tarheel
Copy link
Contributor Author

tarheel commented Mar 28, 2020

Sounds right to me. Will let @moldover make the final call.

@moldover
Copy link
Contributor

Closed via #470

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants