Skip to content
This repository has been archived by the owner on Mar 2, 2024. It is now read-only.

Clean up code #18

Closed
wants to merge 10 commits into from
Closed

Clean up code #18

wants to merge 10 commits into from

Conversation

mscofield0
Copy link

Hi, I stumbled upon this gem of a library when I had the idea of creating a personal cli dictionary for myself. Decided to pitch in somehow so that it gets released to crates.io finally.

Also made the interface of the previously named DictReaderDz not depend on BufRead, but rather have it universally Read + Seek.

@baskerville
Copy link
Contributor

You might want to consider the fixes and enhancements present in my own version.

This was referenced Apr 14, 2022
@mscofield0
Copy link
Author

mscofield0 commented Apr 14, 2022

You might want to consider the fixes and enhancements present in my own version.

Could you give a quick tl;dr of the changes? Also is there any reason why you didn't PR the changes here? Ah, likely because it's not on crates.io

@mscofield0
Copy link
Author

I took a deeper look myself, so only fuzzy searching, right? There's one thing that I don't understand though, what is this lazy parameter in the indexing file? What does it mean exactly?

@baskerville
Copy link
Contributor

Could you give a quick tl;dr of the changes?

Major bug fix : the headwords aren't unique, hence the structure holding the headwords can't be an hash map.

Additions:

  • The index is lazily loaded : initially the index file is only read far enough to retrieve the metadata. That's useful when presenting a list of dictionaries (their names) to the user: we don't want to have to read every index of every dictionary to do so.
  • Fuzzy searching.

@baskerville
Copy link
Contributor

I'm also dealing with the 00-database-allchars and 00-database-case-sensitive (or 00databasecasesensitive) index entries.

@baskerville
Copy link
Contributor

I've also added a fix (cf. the normalize function) for bad dictionaries, not generated by dictfmt, where the entries in the index don't honor the lowercased/non-alphanumeric transformation rules.

Another useful addition : I'm handling the optional fourth column (cf. --index-keep-orig in dictfmt's manual) when parsing index entries.

@mscofield0
Copy link
Author

Useful changes, thanks, I'll incorporate the changes into the PR

@mscofield0
Copy link
Author

I'm a bit confused about the freedict format @baskerville, dict.lookup("Bar") yields Bar\ntest for case-sensitivity\n. I've confirmed that this is the case with other dicts too. It's not really a problem, I was just surprised when I was testing. Do you happen to know why that's the case for the format?

1 similar comment
@mscofield0
Copy link
Author

I'm a bit confused about the freedict format @baskerville, dict.lookup("Bar") yields Bar\ntest for case-sensitivity\n. I've confirmed that this is the case with other dicts too. It's not really a problem, I was just surprised when I was testing. Do you happen to know why that's the case for the format?

@baskerville
Copy link
Contributor

I think you must be referring to the examples in testdata/ and you're right : there's something wrong with those dictionaries (I'm not the author), the headwords shouldn't be in the .dict. You can unpack a dictionary with dictunformat example.index < example.dict > example.txt. If you do that with the test examples, you'll be able to fix the problem in the .txt and regenerate the dictionary with dictfmt --utf8 --index-keep-orig -s Example -u "www.example.com" -t fixed_example < example.txt.

@mscofield0
Copy link
Author

It's not only in testdata, it's also in the original test dictionary lat-deu.dict.dz. Is that dictionary wrong too?

@baskerville
Copy link
Contributor

It's not only in testdata, it's also in the original test dictionary lat-deu.dict.dz. Is that dictionary wrong too?

The aforementioned dictionary is fine because the definitions don't repeat the headword literally:

afer
Afer (Aferis) <n, sg, m>
(Ägypter wurden nicht mit diesem Begriff bezeichnet)
 1. Nordafrikaner
 2. Karthager

@baskerville
Copy link
Contributor

If you're looking for good dictd dictionaries, you might want to checkout my own version of WordNet 3.1.

@baskerville
Copy link
Contributor

It might be worth mentioning that I've also wrote sdunpack.

@mscofield0
Copy link
Author

Still not really sure what's the difference between the two dicts besides one having additional information alongside the headword and the example one doesn't. Parsing both works as expected I guess.

@mscofield0
Copy link
Author

Also, do you think it would be a good idea to introduce a search where diacritics and special characters don't matter? There's a crate unidecode we could use to do that.

@baskerville
Copy link
Contributor

Still not really sure what's the difference between the two dicts besides one having additional information alongside the headword and the example one doesn't. Parsing both works as expected I guess.

To put it simply: the .dict file contains the definitions and the .index file contains the headwords.

The rest is just a matter of style.

@humenda
Copy link
Member

humenda commented Apr 17, 2022 via email

@humenda
Copy link
Member

humenda commented Apr 17, 2022 via email

@baskerville
Copy link
Contributor

Also, do you think it would be a good idea to introduce a search where diacritics and special characters don't matter?

I think it might be problematic because the format itself dictates the transformation that should be applied to the search terms (the one used on the headwords):

--allchars
        Specifies that all characters should be used for the search, by default only alphabetic,
        numeric characters and spaces are put to .index file and therefore are used in search.
        Creates the special entry 00-database-allchars.

@mscofield0
Copy link
Author

Would you please try to isolate changes into separate commits and make the commit descriptions more helpful?

Could I just make a changelog, the commits were already made and pushed so resetting isn't exactly a possibility. I could reset and open a new branch if you really need it to be in commit format.

@mscofield0
Copy link
Author

I think it might be problematic because the format itself dictates the transformation that should be applied to the search terms (the one used on the headwords):

My understanding of the allchars flag is that it accepts chars like - and _, but it doesn't ignore diacritics and special chars like that (ê, û). Am I mistaken?

@baskerville
Copy link
Contributor

I think it might be problematic because the format itself dictates the transformation that should be applied to the search terms (the one used on the headwords):

My understanding of the allchars flag is that it accepts chars like - and _, but it doesn't ignore diacritics and special chars like that (ê, û). Am I mistaken?

--allchars accepts all the chars, but I think you were trying to express the fact that the abscence of --allchars doesn't imply that é becomes e, in other words, the transformation that anyascii provides isn't the default transformation applied to the headwords and that's true.

@mscofield0
Copy link
Author

Yeah, that's what I was trying to say. Do you think we should add it?

@baskerville
Copy link
Contributor

Do you think we should add it?

We could, but I'm not sure how to implement it efficiently: using binary search would require the anyascii transformation to be already applied to all the headwords.

@mscofield0
Copy link
Author

We could map() it first, but yeah, I don't see a more efficient way of doing it.

I guess it doesn't matter though, it would be opt-in. What do you think?

Relaxed queries are queries in which diacritics and other special
characters are checked by their transliterated representation.
@mscofield0
Copy link
Author

mscofield0 commented Apr 19, 2022

Changelog:

Added

  • Added support for 00-database-allchars, 00-database-case-sensitive and 00-database-dictfmt
  • Metadata now exposes info (full dictionary info), short_name, url, all_chars, case_sensitive and should_normalize (if the dict was made by dictfmt)
  • The 4th optional column in index files is now supported and returned if it was found (since the 1st column would have a modified version of the headword)
  • Extracted a few of the errors from DictError into IndexError
  • Added fuzzy searching (up to 1 char)
  • Added "relaxed" searching where diacritics and other special characters are searched according to their transliteration
  • Added from_existing() constructor for Dict to allow for custom readers to be used
  • Added normalize() function that normalizes a non-dictfmt made dictionary according to the dictionary's metadata
  • Added IndexReader trait to support alternate index readers

Changed

  • Index now parses its reader lazily, though it parses metadata when it's created
  • The interfaces of DictReader and IndexReader have been normalized to Read + Seek and BufRead + Seek respectively
  • Index errors should now give a more detailed location of where the problem is (line + column)
  • Cleaned up the code

Many thanks to @baskerville for his input and his implementation over at plato, most of it is just his code reworked a tad bit.

You can copy the changelog if you merge-squash this PR into the commit message, @humenda.

@mscofield0
Copy link
Author

@humenda Can you merge the squash merge the PR with the above changelog?

@humenda
Copy link
Member

humenda commented Apr 24, 2022

Thanks for your work. Just to let you know, I am not able to use the review functionality of GitHub and am hence using plain text comments.

Great that you picked up the refactoring. I have merged the changes to
dict-staging. I would
like to see a few more changes before this goes to master, but wanted to show
you some progress :).

Two general comments (probably said before):

  • For the next PRs, please use more descriptive commit messages. For now, I've
    merged massive rework, but that is not very helpful and nothing I could
    ever revert or cherry-pick, etc. It also leaves me oblivious of what the
    intension of your changes were.
  • Please try to submit smaller PRs. I have left you waiting for quite a while.
    If you could submit smaller PRs, I would merge them much faster.

I would like you to ask to rework the following things:

  • Could we avoid two error modules, i.e. merge this into one error type in the
    top-level, i.e. dict::error?
  • Would it make sense to have a dedicated module for the .dict readers, as you
    have created for the .index file?
  • Could you feature-gate unidecode?
  • Please consider making find return an iterator instead of a Vec.
  • find is complex and hard to complrehend. It might make sense to have
    find_relaxed and find_fuzzy, called by find, to keep a
    easy-to-understand structure.
  • Less importantly, I find the style with # aguments a bit uncommon. That
    looks like good old Doxygen/Javadoc and repeats information that is usually
    inferred from the argument name and its type.

Would you be willing to make a PR with the requested changes to dict-staging?
Thanks!

@humenda humenda closed this Apr 24, 2022
@humenda
Copy link
Member

humenda commented Oct 11, 2022 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants