Document Classifier

The idea behind using a document classifier for a job portal is to give a seamless experience to job seekers as well as to recruiters. Document Classifier helps to classify the document even if the name of the .pdf or .jpg file name is not matching with the actual document enclosed in it.

Dataset:

For this POC, I have collected datasets from several websites. Dataset mainly consists of personal documents data includes Aadhar Card, Pan Card, Marksheets...

Some Examples of Inferring topics with keywords

Aadhar Card -- Government, India, Name, DOB, Gender, Aadhar
PAN card -- Income, Tax, Department, India, Permanent, Account, Number, Card, Name, Father, Date, Birth, Signature.

Implementation:

Identifying file as image or pdf.
- If pdf then converting it into IMG (Using Pillow (PyMuPDF Library) extract all image in .png format).
- Otherwise, continue.
Text Extraction - tesseract (OCR) / Apache Tika to extract text from images.
Identifying words - Supervised machine learning where all data points, the data extracted from each document, in the training data is labelled with the right answer, the correct category for that document.
The job of the machine learning algorithm will then be to find similarities between the documents in each category.
Multinomial Naive Bayes classifier -
- works by first going through all the training data and counting the occurrence of each word in each category.
- Using that data it can calculate the probability of each word belonging to every category.
- When a new document needs to be categorized it calculates the probability of that document belonging to each category by combining the probability of each word in it and then choosing the category with the highest probability.
Testing - Precision and Recall

Made with ♥

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
pdf2img-checkpoint.ipynb		pdf2img-checkpoint.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Classifier

Dataset:

Implementation:

About

Releases

Packages

Languages

aakriti1318/Vasitum-Document-Classifier

Folders and files

Latest commit

History

Repository files navigation

Document Classifier

Dataset:

Implementation:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages