Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If it's convenient for you, could you kindly provide the data you are using? #1

Open
GaoMengnana opened this issue Dec 18, 2024 · 6 comments

Comments

@GaoMengnana
Copy link

While reading papers, I am unsure about the form in which data is provided to LLMs. Is the OCR output directly provided to the model as a whole, asking it to make improvements, or are the errors manually identified first, and then the model is tasked with correcting only the erroneous parts?

@JonnoB
Copy link
Owner

JonnoB commented Dec 18, 2024

This is for correction of OCR error only, so you provide the LLM with the OCR output as is, and it attempts to correct it.

For example paste the following into one of the Frontier models (Chatgpt, sonnet 3.5, Mistral Large, Gemini, etc)
"""
You are an expert at OCR extraction please correct the corrupted OCR below

FOR CHEAP WATCHES , Clocks , Gold Chains , and Jewellery , go to Lombard KIBBLE ' street S , 33 , G and racechurch 51 Ludgate street hill , , op one posite door the from Old fr si B lver ailey om Nine . ditto Gold Shillings , One watches Pound and . T Sixpence wo Five Pourids Shillings each Fifteen . ; Every time Shillings -pieces article , ; exchanird warranted . . List Plate of , prices watches post , and freft jewellery bought or ;
"""

The LM will correct the text, this is the basic idea of CLOCR-C. It is useful in the case you have a large amount of text from OCR process, and don't have the resources to extract the text again.

However, this paper is somewhat out of date now. I have a newer paper called scrambledtext. The rate of change in the LM space is such that I think the whole CLOCR-C concept will be outdated soon. But it is interesting and may be useful to you depending on your use case

@GaoMengnana
Copy link
Author

GaoMengnana commented Dec 20, 2024 via email

@JonnoB
Copy link
Owner

JonnoB commented Dec 20, 2024

If this is the Boros et al paper. It is difficult to know. I first thought that the differences were due to older versions of the language model. However, I now think that perhaps it is because the distribution of correction improvement is quite skewed. In my paper I used the median, whilst Boros et al used the mean. In certain cases particularly when there is hallucination, the "corrected" error can be massive compared to the original. This causes a divergence between the mean and median result.

I checked my results and using the mean instead of the median and it shows generally very poor results. Although I haven't checked the code of Boros et al, I would assume they used the mean instead of the median, and their results were overly influenced by the very poor results that can sometimes be produced. This has a particularly big impact on the normalised levenshtien distance they use, which would disproportionally return a value of -1, pushing their mean values negative.

I assumed that the results would be highly skewed due to the tendency of the LM's to hallucinate particularly corrupted data so I checked the distributions, and chose the median to be robust to skew.

I think this would explain the majority of the difference between our results.

@JonnoB
Copy link
Owner

JonnoB commented Dec 20, 2024

I have submitted the data repository to be published, it has to be reviewed first, and as it is Christmas this may not happen until the new year. However, the scrambled text repo is already public and contains the same data in huggingface format you can access it here

https://rdr.ucl.ac.uk/articles/dataset/Scrambled_text_training_Language_Models_to_correct_OCR_errors_using_synthetic_data/27108334

When the CLOCR-C repo is made public the data is stored as text files which you may find more convenient.

@GaoMengnana
Copy link
Author

GaoMengnana commented Dec 23, 2024 via email

@JonnoB
Copy link
Owner

JonnoB commented Jan 7, 2025

The data is now publicly available on the UCL data repository

https://rdr.ucl.ac.uk/articles/dataset/Transcribed_newspaper_articles_from_the_NCSE_collection/25805008?file=46281040

Your email is not public, I can only reply to the discussion thread on the repo. If you would like me to email, please use the corresponding email address on the paper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants