Skip to content
/ NLP Public

Natural Language Processing (NLP) neural network pipeline with individual components for web scraping (corpus creation), latent subject extraction, and generative text creation

License

Notifications You must be signed in to change notification settings

davelobue/NLP

Repository files navigation

NLP

This project is a compilation of multiple processes and components for extracting a language corpus using a web scraper, text processing, subject extraction, and generative modeling for novel text creation. In sequential order:

Part 1: Scrape the Stanford Encyclopedia of Philosophy, processes individual text paragraphs using word tokenization as 'bag of words'

Part 2: Quantify individual word importance and relevance using TF-IDF scoring and creates a targeted search function based on cosine similarity of word embeddings. Here, there is a built-in command line option for querying the corpus to return relevant subject texts.

Part 3: Unsupervised learning to identify latent document categories and hierarchical structure: Comparison of Linear Discriminant Analysis (LDA), Latent Dirichlet Allocation (LDiA), K-means, hierarchical clustering.

Part 4: Generate novel/new text using artificial recurrent neural network (RNN) architecure: Long short-term memory (LSTM) based on LDiA classifications.

Languages & Packages: Python, Unix, Scrapy, BeautifulSoup, Sklearn, Tensorflow, Keras, NLTK

Statistical, Machine Learning, Deep Learning Methods: Transaction Frequency-Inverse Document Frequency(TF-IDF), Cosine Similarity, Linear Discriminant Analysis (LDA), Latent Dirichlet Allocation (LDiA), K-means, hierarchical clustering, Doc2Vec, Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM)

Scrapy framework required for text extraction Text webscrape source: https://plato.stanford.edu/

About

Natural Language Processing (NLP) neural network pipeline with individual components for web scraping (corpus creation), latent subject extraction, and generative text creation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published