If it's convenient for you, could you kindly provide the data you are using? #1

GaoMengnana · 2024-12-18T16:42:58Z

While reading papers, I am unsure about the form in which data is provided to LLMs. Is the OCR output directly provided to the model as a whole, asking it to make improvements, or are the errors manually identified first, and then the model is tasked with correcting only the erroneous parts?

JonnoB · 2024-12-18T23:29:07Z

This is for correction of OCR error only, so you provide the LLM with the OCR output as is, and it attempts to correct it.

For example paste the following into one of the Frontier models (Chatgpt, sonnet 3.5, Mistral Large, Gemini, etc)
"""
You are an expert at OCR extraction please correct the corrupted OCR below

FOR CHEAP WATCHES , Clocks , Gold Chains , and Jewellery , go to Lombard KIBBLE ' street S , 33 , G and racechurch 51 Ludgate street hill , , op one posite door the from Old fr si B lver ailey om Nine . ditto Gold Shillings , One watches Pound and . T Sixpence wo Five Pourids Shillings each Fifteen . ; Every time Shillings -pieces article , ; exchanird warranted . . List Plate of , prices watches post , and freft jewellery bought or ;
"""

The LM will correct the text, this is the basic idea of CLOCR-C. It is useful in the case you have a large amount of text from OCR process, and don't have the resources to extract the text again.

However, this paper is somewhat out of date now. I have a newer paper called scrambledtext. The rate of change in the LM space is such that I think the whole CLOCR-C concept will be outdated soon. But it is interesting and may be useful to you depending on your use case

GaoMengnana · 2024-12-20T10:13:06Z

Thank you for your response. While reviewing your work, I noticed that your conclusions differ from those of an article published in February. May I ask what factors you think might have contributed to this difference? Could it possibly be due to variations in the datasets used? At 2024-12-19 07:29:28, "Jonathan Bourne" ***@***.***> wrote: This is for correction of OCR error only, so you provide the LLM with the OCR output as is, and it attempts to correct it. For example paste the following into one of the Frontier models (Chatgpt, sonnet 3.5, Mistral Large, Gemini, etc) """ You are an expert at OCR extraction please correct the corrupted OCR below FOR CHEAP WATCHES , Clocks , Gold Chains , and Jewellery , go to Lombard KIBBLE ' street S , 33 , G and racechurch 51 Ludgate street hill , , op one posite door the from Old fr si B lver ailey om Nine . ditto Gold Shillings , One watches Pound and . T Sixpence wo Five Pourids Shillings each Fifteen . ; Every time Shillings -pieces article , ; exchanird warranted . . List Plate of , prices watches post , and freft jewellery bought or ; """ The LM will correct the text, this is the basic idea of CLOCR-C. It is useful in the case you have a large amount of text from OCR process, and don't have the resources to extract the text again. However, this paper is somewhat out of date now. I have a newer paper called scrambledtext. The rate of change in the LM space is such that I think the whole CLOCR-C concept will be outdated soon. But it is interesting and may be useful to you depending on your use case — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

JonnoB · 2024-12-20T10:58:38Z

If this is the Boros et al paper. It is difficult to know. I first thought that the differences were due to older versions of the language model. However, I now think that perhaps it is because the distribution of correction improvement is quite skewed. In my paper I used the median, whilst Boros et al used the mean. In certain cases particularly when there is hallucination, the "corrected" error can be massive compared to the original. This causes a divergence between the mean and median result.

I checked my results and using the mean instead of the median and it shows generally very poor results. Although I haven't checked the code of Boros et al, I would assume they used the mean instead of the median, and their results were overly influenced by the very poor results that can sometimes be produced. This has a particularly big impact on the normalised levenshtien distance they use, which would disproportionally return a value of -1, pushing their mean values negative.

I assumed that the results would be highly skewed due to the tendency of the LM's to hallucinate particularly corrupted data so I checked the distributions, and chose the median to be robust to skew.

I think this would explain the majority of the difference between our results.

JonnoB · 2024-12-20T17:30:55Z

I have submitted the data repository to be published, it has to be reviewed first, and as it is Christmas this may not happen until the new year. However, the scrambled text repo is already public and contains the same data in huggingface format you can access it here

https://rdr.ucl.ac.uk/articles/dataset/Scrambled_text_training_Language_Models_to_correct_OCR_errors_using_synthetic_data/27108334

When the CLOCR-C repo is made public the data is stored as text files which you may find more convenient.

GaoMengnana · 2024-12-23T12:46:40Z

Thank you for all your explanations; they have been very helpful to me. I will also take some time to read your latest work. If you have any new progress you'd like to share, feel free to reach out to me via this email. Lastly, I wish you a Merry Christmas and a wonderful holiday season! At 2024-12-21 01:31:16, "Jonathan Bourne" ***@***.***> wrote: I have submitted the data repository to be published, it has to be reviewed first, and as it is Christmas this may not happen until the new year. However, the scrambled text repo is already public and contains the same data in huggingface format you can access it here https://rdr.ucl.ac.uk/articles/dataset/Scrambled_text_training_Language_Models_to_correct_OCR_errors_using_synthetic_data/27108334 When the CLOCR-C repo is made public the data is stored as text files which you may find more convenient. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

JonnoB · 2025-01-07T18:00:53Z

The data is now publicly available on the UCL data repository

https://rdr.ucl.ac.uk/articles/dataset/Transcribed_newspaper_articles_from_the_NCSE_collection/25805008?file=46281040

Your email is not public, I can only reply to the discussion thread on the repo. If you would like me to email, please use the corresponding email address on the paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If it's convenient for you, could you kindly provide the data you are using? #1

If it's convenient for you, could you kindly provide the data you are using? #1

GaoMengnana commented Dec 18, 2024

JonnoB commented Dec 18, 2024

GaoMengnana commented Dec 20, 2024 via email

JonnoB commented Dec 20, 2024

JonnoB commented Dec 20, 2024

GaoMengnana commented Dec 23, 2024 via email

JonnoB commented Jan 7, 2025

If it's convenient for you, could you kindly provide the data you are using? #1

If it's convenient for you, could you kindly provide the data you are using? #1

Comments

GaoMengnana commented Dec 18, 2024

JonnoB commented Dec 18, 2024

GaoMengnana commented Dec 20, 2024 via email

JonnoB commented Dec 20, 2024

JonnoB commented Dec 20, 2024

GaoMengnana commented Dec 23, 2024 via email

JonnoB commented Jan 7, 2025