Add daily Federal Tax Revenue scraper and resulting data files #79

benjaminarjun · 2018-01-14T02:21:16Z

devinaconley · 2018-01-15T00:26:26Z

Thanks @benarthur91 - this scraper looks good.

The output format we have been trying to conform to is a table with these columns:

year, month, day, metric, count

Usually have been dumping into a CSV for testing, then will push directly to a postgres database when integrated.

Do you want to try building the table parser for those raw text files as well?

benjaminarjun · 2018-01-15T02:57:04Z

Definitely! I'll get started.

benjaminarjun · 2018-01-26T04:14:59Z

I'm moving into the testing phase with this parser. I expect to update the pull request sometime over the weekend.

I've noticed the file contains a variety of metrics, and adequately describing them in the standard schema may be difficult. For example, a label that uses all available descriptions might look something like:

Deposits and Withdrawals of Operating Cash::Deposits::Federal Reserve Account: Deposits by States: Supplemental Security Income::This month to date

It may be helpful to either:

Define the database schema similarly to the file structure, so the actual labels in the data don't need to be fully qualified.
Hone in on the metrics that are actually meaningful to this project, so they don't need to be described as precisely to distinguish from others.

I'll continue to build the parser to write fully qualified names in the CSV, but let me know if you have any input on this.

devinaconley · 2018-01-27T02:18:53Z

I think we should avoid making any changes to the database schema for this specific data source. Want to keep things as general as possible.

On the "this month to date" specifier, we actually only need to scrape the raw daily value. Something like "this month to date" would be calculated on the graphing and visualization side.

Also think about what makes sense to keep as a single table. For example in this case, it might make sense to separate deposits and withdrawals into their own tables.

benjaminarjun · 2018-01-28T23:17:51Z

Currently the parser grabs all the data for Today and writes it into a single file (one output file per source file). If it's possible to populate multiple DB tables per source file, I can look to separate the output into multiple files.

devinaconley · 2018-02-15T13:13:35Z

I think that's a good idea. Looking at how many different metrics are in each file, we will likely want a configurable way to filter specific metrics and map them to a specific database.

Also, when we push a certain set of metrics to the database, data from different days will all be in the same table.

This is looking good!

benjaminarjun · 2018-05-02T05:00:03Z

I've added a config file for the parser. This allows the caller to specify the names of the files that should be output, and which data fields should go in which file. The metric name is still the "fully qualified" attribute name as mentioned above. Quick guide to configuration:

A collection of file targets is specified. Each file target has a default regex pattern; attributes whose fully qualified name match the default pattern will go to that file. The default set of file targets is already specified, but can be modified/deleted, and new targets added.
The caller can also specify mapping overrides: this consists of a pattern for an attribute name and a file ID. Attributes matching the pattern will go to the specified file. This takes precedence over a match on a particular file's default attribute pattern.
If any attributes are not matched by the file or override config, the parser will raise an exception and exit.

One thing I noticed is that some of the values in the source file are asterisks rather than integers, and at the bottom is a note explaining the asterisk: "Statutory debt limit is temporarily suspended through December 8, 2017". How should the parser handle these?

Add daily Federal Tax Revenue scraper and resulting data files

72e96ab

benjaminarjun added 3 commits January 28, 2018 13:15

Add raw file parser. Fix regex bug in scraper

6d913dd

Add parsed versions of raw files from initial scraper commit (72e96a)

c46e6e4

Fix bug where footnotes were being captured in tax revenue scraper

3febf6d

benjaminarjun added 6 commits May 1, 2018 19:26

Add config file

ca4d886

Modify parser class to write CSVs according to config file

269d715

Update config file: fix bug in example, remove unnecessary OrderedDict

ed94945

Fix some broken mapping override code

7bd05d6

Data commit: replace sample parsed files with output of current parser

06434ff

Fix over-indented line

18a6026

devinaconley approved these changes Aug 3, 2018

View reviewed changes

devinaconley merged commit 44417e3 into Data4Democracy:master Aug 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add daily Federal Tax Revenue scraper and resulting data files #79

Add daily Federal Tax Revenue scraper and resulting data files #79

benjaminarjun commented Jan 14, 2018

devinaconley commented Jan 15, 2018

benjaminarjun commented Jan 15, 2018

benjaminarjun commented Jan 26, 2018 •

edited

Loading

devinaconley commented Jan 27, 2018

benjaminarjun commented Jan 28, 2018

devinaconley commented Feb 15, 2018

benjaminarjun commented May 2, 2018

Add daily Federal Tax Revenue scraper and resulting data files #79

Add daily Federal Tax Revenue scraper and resulting data files #79

Conversation

benjaminarjun commented Jan 14, 2018

devinaconley commented Jan 15, 2018

benjaminarjun commented Jan 15, 2018

benjaminarjun commented Jan 26, 2018 • edited Loading

devinaconley commented Jan 27, 2018

benjaminarjun commented Jan 28, 2018

devinaconley commented Feb 15, 2018

benjaminarjun commented May 2, 2018

benjaminarjun commented Jan 26, 2018 •

edited

Loading