Natural Language Processing
Natural Language Processing Notes
Part - 1
Popular Packages :
Data Sources
Twitter data can be extracted using the package tweepy
PDF file data can be processed with PyPDF2 library
Word files can be read with docx package.
Tokenizing Words with regular expressions.
Part - 2
Exploring and processing text data
steps :
lowercasing
punctuation removal
stop words removal
text standardization
spelling correction
tokenization
stemming
lemmatization
eda
end to end pipeline processing
Converting text data to lowercase
using the lower() python method.
punctuations can be removed with either regex or using the replace method, punctuations can be identified with the string module.
Removing Stop Words
simplest way to do this is to use the NLTK module.
Standardizing Text
converting short forms to long and syntatically correct sentences.
Correcting Spelling
using textblob
Tokenizing Text
can be easily performed with libraries like nltk, spacy and textblob.
Stemming
Process of extracting root words.
Lemmatizing
Extracting root word in terms of vocabulary.
available libraries :