-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If it's convenient for you, could you kindly provide the data you are using? #1
Comments
This is for correction of OCR error only, so you provide the LLM with the OCR output as is, and it attempts to correct it. For example paste the following into one of the Frontier models (Chatgpt, sonnet 3.5, Mistral Large, Gemini, etc) FOR CHEAP WATCHES , Clocks , Gold Chains , and Jewellery , go to Lombard KIBBLE ' street S , 33 , G and racechurch 51 Ludgate street hill , , op one posite door the from Old fr si B lver ailey om Nine . ditto Gold Shillings , One watches Pound and . T Sixpence wo Five Pourids Shillings each Fifteen . ; Every time Shillings -pieces article , ; exchanird warranted . . List Plate of , prices watches post , and freft jewellery bought or ; The LM will correct the text, this is the basic idea of CLOCR-C. It is useful in the case you have a large amount of text from OCR process, and don't have the resources to extract the text again. However, this paper is somewhat out of date now. I have a newer paper called scrambledtext. The rate of change in the LM space is such that I think the whole CLOCR-C concept will be outdated soon. But it is interesting and may be useful to you depending on your use case |
Thank you for your response. While reviewing your work, I noticed that your conclusions differ from those of an article published in February. May I ask what factors you think might have contributed to this difference? Could it possibly be due to variations in the datasets used?
At 2024-12-19 07:29:28, "Jonathan Bourne" ***@***.***> wrote:
This is for correction of OCR error only, so you provide the LLM with the OCR output as is, and it attempts to correct it.
For example paste the following into one of the Frontier models (Chatgpt, sonnet 3.5, Mistral Large, Gemini, etc)
"""
You are an expert at OCR extraction please correct the corrupted OCR below
FOR CHEAP WATCHES , Clocks , Gold Chains , and Jewellery , go to Lombard KIBBLE ' street S , 33 , G and racechurch 51 Ludgate street hill , , op one posite door the from Old fr si B lver ailey om Nine . ditto Gold Shillings , One watches Pound and . T Sixpence wo Five Pourids Shillings each Fifteen . ; Every time Shillings -pieces article , ; exchanird warranted . . List Plate of , prices watches post , and freft jewellery bought or ;
"""
The LM will correct the text, this is the basic idea of CLOCR-C. It is useful in the case you have a large amount of text from OCR process, and don't have the resources to extract the text again.
However, this paper is somewhat out of date now. I have a newer paper called scrambledtext. The rate of change in the LM space is such that I think the whole CLOCR-C concept will be outdated soon. But it is interesting and may be useful to you depending on your use case
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
If this is the Boros et al paper. It is difficult to know. I first thought that the differences were due to older versions of the language model. However, I now think that perhaps it is because the distribution of correction improvement is quite skewed. In my paper I used the median, whilst Boros et al used the mean. In certain cases particularly when there is hallucination, the "corrected" error can be massive compared to the original. This causes a divergence between the mean and median result. I checked my results and using the mean instead of the median and it shows generally very poor results. Although I haven't checked the code of Boros et al, I would assume they used the mean instead of the median, and their results were overly influenced by the very poor results that can sometimes be produced. This has a particularly big impact on the normalised levenshtien distance they use, which would disproportionally return a value of -1, pushing their mean values negative. I assumed that the results would be highly skewed due to the tendency of the LM's to hallucinate particularly corrupted data so I checked the distributions, and chose the median to be robust to skew. I think this would explain the majority of the difference between our results. |
I have submitted the data repository to be published, it has to be reviewed first, and as it is Christmas this may not happen until the new year. However, the scrambled text repo is already public and contains the same data in huggingface format you can access it here When the CLOCR-C repo is made public the data is stored as text files which you may find more convenient. |
Thank you for all your explanations; they have been very helpful to me. I will also take some time to read your latest work. If you have any new progress you'd like to share, feel free to reach out to me via this email. Lastly, I wish you a Merry Christmas and a wonderful holiday season!
At 2024-12-21 01:31:16, "Jonathan Bourne" ***@***.***> wrote:
I have submitted the data repository to be published, it has to be reviewed first, and as it is Christmas this may not happen until the new year. However, the scrambled text repo is already public and contains the same data in huggingface format you can access it here
https://rdr.ucl.ac.uk/articles/dataset/Scrambled_text_training_Language_Models_to_correct_OCR_errors_using_synthetic_data/27108334
When the CLOCR-C repo is made public the data is stored as text files which you may find more convenient.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
The data is now publicly available on the UCL data repository Your email is not public, I can only reply to the discussion thread on the repo. If you would like me to email, please use the corresponding email address on the paper |
While reading papers, I am unsure about the form in which data is provided to LLMs. Is the OCR output directly provided to the model as a whole, asking it to make improvements, or are the errors manually identified first, and then the model is tasked with correcting only the erroneous parts?
The text was updated successfully, but these errors were encountered: