Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add daily Federal Tax Revenue scraper and resulting data files #79

Conversation

benjaminarjun
Copy link
Contributor

#74

@devinaconley
Copy link
Collaborator

Thanks @benarthur91 - this scraper looks good.

The output format we have been trying to conform to is a table with these columns:

year, month, day, metric, count

Usually have been dumping into a CSV for testing, then will push directly to a postgres database when integrated.

Do you want to try building the table parser for those raw text files as well?

@benjaminarjun
Copy link
Contributor Author

Definitely! I'll get started.

@benjaminarjun
Copy link
Contributor Author

benjaminarjun commented Jan 26, 2018

I'm moving into the testing phase with this parser. I expect to update the pull request sometime over the weekend.

I've noticed the file contains a variety of metrics, and adequately describing them in the standard schema may be difficult. For example, a label that uses all available descriptions might look something like:

Deposits and Withdrawals of Operating Cash::Deposits::Federal Reserve Account: Deposits by States: Supplemental Security Income::This month to date

It may be helpful to either:

  1. Define the database schema similarly to the file structure, so the actual labels in the data don't need to be fully qualified.
  2. Hone in on the metrics that are actually meaningful to this project, so they don't need to be described as precisely to distinguish from others.

I'll continue to build the parser to write fully qualified names in the CSV, but let me know if you have any input on this.

@devinaconley
Copy link
Collaborator

I think we should avoid making any changes to the database schema for this specific data source. Want to keep things as general as possible.

On the "this month to date" specifier, we actually only need to scrape the raw daily value. Something like "this month to date" would be calculated on the graphing and visualization side.

Also think about what makes sense to keep as a single table. For example in this case, it might make sense to separate deposits and withdrawals into their own tables.

@benjaminarjun
Copy link
Contributor Author

Currently the parser grabs all the data for Today and writes it into a single file (one output file per source file). If it's possible to populate multiple DB tables per source file, I can look to separate the output into multiple files.

@devinaconley
Copy link
Collaborator

I think that's a good idea. Looking at how many different metrics are in each file, we will likely want a configurable way to filter specific metrics and map them to a specific database.

Also, when we push a certain set of metrics to the database, data from different days will all be in the same table.

This is looking good!

@benjaminarjun
Copy link
Contributor Author

I've added a config file for the parser. This allows the caller to specify the names of the files that should be output, and which data fields should go in which file. The metric name is still the "fully qualified" attribute name as mentioned above. Quick guide to configuration:

  • A collection of file targets is specified. Each file target has a default regex pattern; attributes whose fully qualified name match the default pattern will go to that file. The default set of file targets is already specified, but can be modified/deleted, and new targets added.
  • The caller can also specify mapping overrides: this consists of a pattern for an attribute name and a file ID. Attributes matching the pattern will go to the specified file. This takes precedence over a match on a particular file's default attribute pattern.
  • If any attributes are not matched by the file or override config, the parser will raise an exception and exit.

One thing I noticed is that some of the values in the source file are asterisks rather than integers, and at the bottom is a note explaining the asterisk: "Statutory debt limit is temporarily suspended through December 8, 2017". How should the parser handle these?

@devinaconley devinaconley merged commit 44417e3 into Data4Democracy:master Aug 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants