NLP

This project is a compilation of multiple processes and components for extracting a language corpus using a web scraper, text processing, subject extraction, and generative modeling for novel text creation. In sequential order:

Part 1: Scrape the Stanford Encyclopedia of Philosophy, processes individual text paragraphs using word tokenization as 'bag of words'

Part 2: Quantify individual word importance and relevance using TF-IDF scoring and creates a targeted search function based on cosine similarity of word embeddings. Here, there is a built-in command line option for querying the corpus to return relevant subject texts.

Part 3: Unsupervised learning to identify latent document categories and hierarchical structure: Comparison of Linear Discriminant Analysis (LDA), Latent Dirichlet Allocation (LDiA), K-means, hierarchical clustering.

Part 4: Generate novel/new text using artificial recurrent neural network (RNN) architecure: Long short-term memory (LSTM) based on LDiA classifications.

Languages & Packages: Python, Unix, Scrapy, BeautifulSoup, Sklearn, Tensorflow, Keras, NLTK

Statistical, Machine Learning, Deep Learning Methods: Transaction Frequency-Inverse Document Frequency(TF-IDF), Cosine Similarity, Linear Discriminant Analysis (LDA), Latent Dirichlet Allocation (LDiA), K-means, hierarchical clustering, Doc2Vec, Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM)

Scrapy framework required for text extraction Text webscrape source: https://plato.stanford.edu/

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
DocumentClassification_Part3_1.ipynb		DocumentClassification_Part3_1.ipynb
DocumentClassification_Part3_2.ipynb		DocumentClassification_Part3_2.ipynb
LICENSE		LICENSE
NLP_TFIDF_Cosine.py		NLP_TFIDF_Cosine.py
NNTextGeneration_LSTM_Part4.ipynb		NNTextGeneration_LSTM_Part4.ipynb
README.md		README.md
dataprocesses.py		dataprocesses.py
philosophy-spider.py		philosophy-spider.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP

About

Releases

Packages

Languages

License

davelobue/NLP

Folders and files

Latest commit

History

Repository files navigation

NLP

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages