Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved docs #335

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Improved docs #335

wants to merge 5 commits into from

Conversation

tony
Copy link
Member

@tony tony commented Feb 27, 2025

Changes

Improved Docs

Summary by Sourcery

Adds example code demonstrating various use cases of the unihan-etl library, including linguistic analysis, educational tools, data integration, software development, research analysis, input method development, stroke order extraction, and API development.

Tests:

  • Adds example tests showcasing the usage of the unihan-etl library for different applications.
  • Adds tests for extracting character learning data for educational applications.
  • Adds tests for database population with UNIHAN data.
  • Adds tests for extracting dictionary data from UNIHAN for software development.
  • Adds tests for extracting and analyzing etymology data with UNIHAN.
  • Adds tests for input method development with UNIHAN data.
  • Adds tests for extracting and analyzing stroke order data.
  • Adds tests for custom data processing with UNIHAN data.
  • Adds tests for developing an API with unihan-etl data.
  • Adds tests for using custom fields with UNIHAN data.
  • Adds tests for filtering characters in the UNIHAN dataset.
  • Adds tests for accessing UNIHAN fields metadata.
  • Adds tests for basic usage of the Packager class to get data.
  • Adds tests for retrieving specific character information.

Copy link

sourcery-ai bot commented Feb 27, 2025

Reviewer's Guide by Sourcery

This pull request adds a comprehensive suite of example tests to the unihan-etl library. These tests demonstrate various use cases, including linguistic analysis, educational tools, data integration, software development, research analysis, input method development, stroke order extraction, advanced API usage, custom fields, character filtering, and basic data retrieval. Each test provides a practical example of how to leverage the library for specific tasks, enhancing its usability and showcasing its versatility.

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change Details Files
Added example tests demonstrating various use cases of the unihan-etl library, such as linguistic analysis, educational tools, data integration, software development, research analysis, input method development, stroke order extraction, advanced API usage, custom fields, character filtering, and character lookup.
  • Created test_linguistic_analysis.py to demonstrate linguistic analysis using UNIHAN data.
  • Created test_educational_tools.py to demonstrate extracting character learning data for educational applications.
  • Created test_data_integration.py to demonstrate integrating UNIHAN data with database systems.
  • Created test_software_dev.py to demonstrate extracting dictionary data for software development.
  • Created test_research_analysis.py to demonstrate extracting etymology data for research analysis.
  • Created test_input_method.py to demonstrate input method development using UNIHAN data.
  • Created test_stroke_order.py to demonstrate extracting and analyzing stroke order data.
  • Created test_advanced_api.py to demonstrate building advanced processing pipelines with UNIHAN data.
  • Created test_api_development.py to demonstrate developing an API with unihan-etl data.
  • Created test_custom_fields.py to demonstrate using custom fields with UNIHAN data.
  • Created test_character_filtering.py to demonstrate filtering characters in the UNIHAN dataset.
  • Created test_unihan_fields.py to demonstrate working with UNIHAN field metadata.
  • Created test_basic_usage.py to demonstrate basic usage of the Packager class to get data.
  • Created test_character_lookup.py to demonstrate retrieving specific character information.
tests/examples/test_linguistic_analysis.py
tests/examples/test_educational_tools.py
tests/examples/test_data_integration.py
tests/examples/test_software_dev.py
tests/examples/test_research_analysis.py
tests/examples/test_input_method.py
tests/examples/test_stroke_order.py
tests/examples/test_advanced_api.py
tests/examples/test_api_development.py
tests/examples/test_custom_fields.py
tests/examples/test_character_filtering.py
tests/examples/test_unihan_fields.py
tests/examples/test_basic_usage.py
tests/examples/test_character_lookup.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

codecov bot commented Feb 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.53%. Comparing base (a47f637) to head (b15dcfb).

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #335       +/-   ##
===========================================
- Coverage   70.03%   59.53%   -10.51%     
===========================================
  Files          13        8        -5     
  Lines        1325      939      -386     
  Branches      114       99       -15     
===========================================
- Hits          928      559      -369     
+ Misses        372      361       -11     
+ Partials       25       19        -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tony - I've reviewed your changes - here's some feedback:

Overall Comments:

  • These examples are great, but consider adding a README or tutorial to guide users on how to run them.
  • It might be helpful to include a section on error handling and edge cases within the examples.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

filtered_packager = Packager(options)

# Download the filtered data
filtered_packager.download()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


try:
# Create a table for the UNIHAN data
cursor.execute("""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Extract code out into function (extract-method)

)

# Verify we created some educational data
assert len(educational_data) > 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Simplify sequence length comparison (simplify-len-comparison)

Suggested change
assert len(educational_data) > 0
assert educational_data

pinyin_to_chars[pinyin_key].append(item["char"])

# Verify our input method dictionary has entries
assert len(pinyin_to_chars) > 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Simplify sequence length comparison (simplify-len-comparison)

Suggested change
assert len(pinyin_to_chars) > 0
assert pinyin_to_chars

)

# Verify we found some correspondences
assert len(sound_correspondences) > 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Simplify sequence length comparison (simplify-len-comparison)

Suggested change
assert len(sound_correspondences) > 0
assert sound_correspondences


# Verify we extracted data
if data is not None:
assert len(variants_data) > 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Simplify sequence length comparison (simplify-len-comparison)

Suggested change
assert len(variants_data) > 0
assert variants_data

f"for {item.get('char', 'Unknown')}"
)

char = item.get("char", "")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Comment on lines +43 to +48
fields_per_file = {}
for filename, fields in UNIHAN_MANIFEST.items():
fields_per_file[filename] = len(fields)

# Verify we have field counts
assert len(fields_per_file) > 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

@tony tony force-pushed the improved-docs branch 2 times, most recently from 1ee19cc to 34763ba Compare February 28, 2025 10:00
tony added 5 commits February 28, 2025 13:34
… examples

This commit resolves all test failures in the example test suite by:

1. Adding proper type annotations across all example tests:
   - Use modern Python type hints (e.g., `list[dict[str, Any]]` instead of `List[Dict[str, Any]]`)
   - Add proper type casts (`cast()`) for handling ambiguous return types
   - Fix incorrect type signatures for function parameters and return values
   - Ensure consistent type annotation style across all test files

2. Fixing test implementation issues:
   - Replace invalid field 'kFrequency' with supported fields in educational_tools test
   - Change 'kRSKangXi' to 'kRSUnicode' in stroke_order test
   - Simplify stroke_order test to use existing data rather than creating a new packager
   - Fix handling of list-type values by properly converting them to strings
   - Add proper null-checks and defensive programming for external data

3. Improve code quality:
   - Replace if-else blocks with ternary operators for conciseness
   - Convert for-loops to list comprehensions where appropriate
   - Add proper error handling for data conversion operations
   - Fix line length issues to comply with style guidelines
   - Add meaningful debug output for troubleshooting

4. Ensure test robustness:
   - Add fallback mechanisms for tests that depend on specific data patterns
   - Improve assertions to verify data integrity
   - Add type guards to prevent runtime errors with ambiguous types

All tests now pass consistently, type checking with mypy succeeds with zero issues,
and code formatting conforms to project standards. These changes improve code
maintainability, readability, and reliability while providing example code that
demonstrates best practices for using the unihan-etl library.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant