Clean up code #18

mscofield0 · 2022-04-14T01:07:26Z

Hi, I stumbled upon this gem of a library when I had the idea of creating a personal cli dictionary for myself. Decided to pitch in somehow so that it gets released to crates.io finally.

Also made the interface of the previously named DictReaderDz not depend on BufRead, but rather have it universally Read + Seek.

baskerville · 2022-04-14T12:16:48Z

You might want to consider the fixes and enhancements present in my own version.

mscofield0 · 2022-04-14T13:09:53Z

You might want to consider the fixes and enhancements present in my own version.

Could you give a quick tl;dr of the changes? ~~Also is there any reason why you didn't PR the changes here?~~ Ah, likely because it's not on crates.io

mscofield0 · 2022-04-14T13:54:58Z

I took a deeper look myself, so only fuzzy searching, right? There's one thing that I don't understand though, what is this lazy parameter in the indexing file? What does it mean exactly?

baskerville · 2022-04-14T16:52:04Z

Could you give a quick tl;dr of the changes?

Major bug fix : the headwords aren't unique, hence the structure holding the headwords can't be an hash map.

Additions:

The index is lazily loaded : initially the index file is only read far enough to retrieve the metadata. That's useful when presenting a list of dictionaries (their names) to the user: we don't want to have to read every index of every dictionary to do so.
Fuzzy searching.

baskerville · 2022-04-14T17:08:52Z

I'm also dealing with the 00-database-allchars and 00-database-case-sensitive (or 00databasecasesensitive) index entries.

baskerville · 2022-04-14T19:02:09Z

I've also added a fix (cf. the normalize function) for bad dictionaries, not generated by dictfmt, where the entries in the index don't honor the lowercased/non-alphanumeric transformation rules.

Another useful addition : I'm handling the optional fourth column (cf. --index-keep-orig in dictfmt's manual) when parsing index entries.

mscofield0 · 2022-04-14T19:12:40Z

Useful changes, thanks, I'll incorporate the changes into the PR

mscofield0 · 2022-04-17T12:39:16Z

I'm a bit confused about the freedict format @baskerville, dict.lookup("Bar") yields Bar\ntest for case-sensitivity\n. I've confirmed that this is the case with other dicts too. It's not really a problem, I was just surprised when I was testing. Do you happen to know why that's the case for the format?

mscofield0 · 2022-04-17T12:49:02Z

I'm a bit confused about the freedict format @baskerville, dict.lookup("Bar") yields Bar\ntest for case-sensitivity\n. I've confirmed that this is the case with other dicts too. It's not really a problem, I was just surprised when I was testing. Do you happen to know why that's the case for the format?

baskerville · 2022-04-17T13:16:45Z

I think you must be referring to the examples in testdata/ and you're right : there's something wrong with those dictionaries (I'm not the author), the headwords shouldn't be in the .dict. You can unpack a dictionary with dictunformat example.index < example.dict > example.txt. If you do that with the test examples, you'll be able to fix the problem in the .txt and regenerate the dictionary with dictfmt --utf8 --index-keep-orig -s Example -u "www.example.com" -t fixed_example < example.txt.

mscofield0 · 2022-04-17T13:18:25Z

It's not only in testdata, it's also in the original test dictionary lat-deu.dict.dz. Is that dictionary wrong too?

baskerville · 2022-04-17T16:40:08Z

It's not only in testdata, it's also in the original test dictionary lat-deu.dict.dz. Is that dictionary wrong too?

The aforementioned dictionary is fine because the definitions don't repeat the headword literally:

afer
Afer (Aferis) <n, sg, m>
(Ägypter wurden nicht mit diesem Begriff bezeichnet)
 1. Nordafrikaner
 2. Karthager

baskerville · 2022-04-17T17:00:49Z

If you're looking for good dictd dictionaries, you might want to checkout my own version of WordNet 3.1.

baskerville · 2022-04-17T17:09:59Z

It might be worth mentioning that I've also wrote sdunpack.

mscofield0 · 2022-04-17T17:38:31Z

Still not really sure what's the difference between the two dicts besides one having additional information alongside the headword and the example one doesn't. Parsing both works as expected I guess.

mscofield0 · 2022-04-17T17:40:47Z

Also, do you think it would be a good idea to introduce a search where diacritics and special characters don't matter? There's a crate unidecode we could use to do that.

baskerville · 2022-04-17T18:50:08Z

Still not really sure what's the difference between the two dicts besides one having additional information alongside the headword and the example one doesn't. Parsing both works as expected I guess.

To put it simply: the .dict file contains the definitions and the .index file contains the headwords.

The rest is just a matter of style.

humenda · 2022-04-17T18:54:58Z

If you're looking for good dictd dictionaries, you might want to checkout [my own version of WordNet 3.1](https://www.mobileread.com/forums/showpost.php?p=4006583&postcount=772).

Actually, the largest collection is FreeDict -- https://freedict.org . Guess why this crate is within this organisation :-).

humenda · 2022-04-17T18:57:18Z

Thanks for your work. I am still on your other PR and haven't found enough time to summarise my review. A short meta comment: would you please try to isolate changes into separate commits and make the commit descriptions more helpful? More than a one-liner and explanations on what was changed why would be great. Thanks again.

baskerville · 2022-04-17T19:14:11Z

Also, do you think it would be a good idea to introduce a search where diacritics and special characters don't matter?

I think it might be problematic because the format itself dictates the transformation that should be applied to the search terms (the one used on the headwords):

--allchars
        Specifies that all characters should be used for the search, by default only alphabetic,
        numeric characters and spaces are put to .index file and therefore are used in search.
        Creates the special entry 00-database-allchars.

mscofield0 · 2022-04-17T19:17:27Z

Would you please try to isolate changes into separate commits and make the commit descriptions more helpful?

Could I just make a changelog, the commits were already made and pushed so resetting isn't exactly a possibility. I could reset and open a new branch if you really need it to be in commit format.

mscofield0 · 2022-04-17T19:20:28Z

I think it might be problematic because the format itself dictates the transformation that should be applied to the search terms (the one used on the headwords):

My understanding of the allchars flag is that it accepts chars like - and _, but it doesn't ignore diacritics and special chars like that (ê, û). Am I mistaken?

baskerville · 2022-04-17T19:46:57Z

I think it might be problematic because the format itself dictates the transformation that should be applied to the search terms (the one used on the headwords):

My understanding of the allchars flag is that it accepts chars like - and _, but it doesn't ignore diacritics and special chars like that (ê, û). Am I mistaken?

--allchars accepts all the chars, but I think you were trying to express the fact that the abscence of --allchars doesn't imply that é becomes e, in other words, the transformation that anyascii provides isn't the default transformation applied to the headwords and that's true.

mscofield0 · 2022-04-17T19:49:12Z

Yeah, that's what I was trying to say. Do you think we should add it?

baskerville · 2022-04-17T20:10:15Z

Do you think we should add it?

We could, but I'm not sure how to implement it efficiently: using binary search would require the anyascii transformation to be already applied to all the headwords.

mscofield0 · 2022-04-17T20:12:16Z

We could map() it first, but yeah, I don't see a more efficient way of doing it.

I guess it doesn't matter though, it would be opt-in. What do you think?

Relaxed queries are queries in which diacritics and other special characters are checked by their transliterated representation.

mscofield0 · 2022-04-19T23:51:44Z

Changelog:

Added

Added support for 00-database-allchars, 00-database-case-sensitive and 00-database-dictfmt
Metadata now exposes info (full dictionary info), short_name, url, all_chars, case_sensitive and should_normalize (if the dict was made by dictfmt)
The 4th optional column in index files is now supported and returned if it was found (since the 1st column would have a modified version of the headword)
Extracted a few of the errors from DictError into IndexError
Added fuzzy searching (up to 1 char)
Added "relaxed" searching where diacritics and other special characters are searched according to their transliteration
Added from_existing() constructor for Dict to allow for custom readers to be used
Added normalize() function that normalizes a non-dictfmt made dictionary according to the dictionary's metadata
Added IndexReader trait to support alternate index readers

Changed

Index now parses its reader lazily, though it parses metadata when it's created
The interfaces of DictReader and IndexReader have been normalized to Read + Seek and BufRead + Seek respectively
Index errors should now give a more detailed location of where the problem is (line + column)
Cleaned up the code

Many thanks to @baskerville for his input and his implementation over at plato, most of it is just his code reworked a tad bit.

You can copy the changelog if you merge-squash this PR into the commit message, @humenda.

src/index/mod.rs

mscofield0 · 2022-04-24T09:00:09Z

@humenda Can you merge the squash merge the PR with the above changelog?

humenda · 2022-04-24T17:44:17Z

Thanks for your work. Just to let you know, I am not able to use the review functionality of GitHub and am hence using plain text comments.

Great that you picked up the refactoring. I have merged the changes to
dict-staging. I would
like to see a few more changes before this goes to master, but wanted to show
you some progress :).

Two general comments (probably said before):

For the next PRs, please use more descriptive commit messages. For now, I've
merged massive rework, but that is not very helpful and nothing I could
ever revert or cherry-pick, etc. It also leaves me oblivious of what the
intension of your changes were.
Please try to submit smaller PRs. I have left you waiting for quite a while.
If you could submit smaller PRs, I would merge them much faster.

I would like you to ask to rework the following things:

Could we avoid two error modules, i.e. merge this into one error type in the
top-level, i.e. dict::error?
Would it make sense to have a dedicated module for the .dict readers, as you
have created for the .index file?
Could you feature-gate unidecode?
Please consider making find return an iterator instead of a Vec.
find is complex and hard to complrehend. It might make sense to have
find_relaxed and find_fuzzy, called by find, to keep a
easy-to-understand structure.
Less importantly, I find the style with # aguments a bit uncommon. That
looks like good old Doxygen/Javadoc and repeats information that is usually
inferred from the argument name and its type.

Would you be willing to make a PR with the requested changes to dict-staging?
Thanks!

humenda · 2022-10-11T06:58:52Z

Thanks, I'll review the changes today. Could you do me a favour and rebase and squash the changes yourself?

mscofield0 added 4 commits April 14, 2022 02:38

Massive rework

b03a4ba

Tiny rework

0e6e50f

Update readme

b5f4142

Update interface of Index

e0b5854

This was referenced Apr 14, 2022

Add mmap support #19

Draft

Publish on crates.io? #6

Open

Rework according to @baskerville's changes

0cc5d5d

mscofield0 added 4 commits April 20, 2022 01:01

Added support for "relaxed" queries

4bd457f

Relaxed queries are queries in which diacritics and other special characters are checked by their transliterated representation.

Use 4th column as a result headword if possible

9dcf343

Fixed missing normalize check

4b94b8f

Added more docs

b415c17

baskerville reviewed Apr 20, 2022

View reviewed changes

src/index/mod.rs Outdated Show resolved Hide resolved

Fixed bug with using binary search for searching transliterated words

436d837

humenda closed this Apr 24, 2022

baskerville mentioned this pull request Oct 11, 2023

Improve performance a lot baskerville/sdunpack#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up code #18

Clean up code #18

mscofield0 commented Apr 14, 2022

baskerville commented Apr 14, 2022

mscofield0 commented Apr 14, 2022 •

edited

Loading

mscofield0 commented Apr 14, 2022

baskerville commented Apr 14, 2022

baskerville commented Apr 14, 2022

baskerville commented Apr 14, 2022

mscofield0 commented Apr 14, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

baskerville commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

humenda commented Apr 17, 2022 via email

humenda commented Apr 17, 2022 via email

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 19, 2022 •

edited

Loading

mscofield0 commented Apr 24, 2022

humenda commented Apr 24, 2022

humenda commented Oct 11, 2022 via email

Clean up code #18

Clean up code #18

Conversation

mscofield0 commented Apr 14, 2022

baskerville commented Apr 14, 2022

mscofield0 commented Apr 14, 2022 • edited Loading

mscofield0 commented Apr 14, 2022

baskerville commented Apr 14, 2022

baskerville commented Apr 14, 2022

baskerville commented Apr 14, 2022

mscofield0 commented Apr 14, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

baskerville commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

humenda commented Apr 17, 2022 via email

humenda commented Apr 17, 2022 via email

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

baskerville commented Apr 17, 2022

mscofield0 commented Apr 17, 2022

mscofield0 commented Apr 19, 2022 • edited Loading

Changelog:

Added

Changed

mscofield0 commented Apr 24, 2022

humenda commented Apr 24, 2022

humenda commented Oct 11, 2022 via email

mscofield0 commented Apr 14, 2022 •

edited

Loading

mscofield0 commented Apr 19, 2022 •

edited

Loading