{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Feature Extraction I - Text.ipynb","provenance":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"code","metadata":{"id":"52OWABfCpOJ2"},"source":["!pip install --upgrade scikit-learn"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"c7stLQzPpMF_"},"source":["%matplotlib inline\n","import matplotlib.pyplot as plt\n","import numpy as np\n","\n","# Some nice default configuration for plots\n","plt.rcParams['figure.figsize'] = 10, 7.5\n","plt.rcParams['axes.grid'] = True"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ako48TBRog-b"},"source":["# Extracting Features from Text\n","\n","## The 20 Newsgroups Dataset\n","We download and extract the [20 Newsgroups Dataset](http://qwone.com/~jason/20Newsgroups/)"]},{"cell_type":"markdown","metadata":{"id":"RerG84i4xbzC"},"source":[""]},{"cell_type":"code","metadata":{"id":"CGvL709soBea"},"source":["!wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz\n","!tar xzvf 20news-bydate.tar.gz"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"BA6Kn36dK_Wb"},"source":["!cat 20news-bydate-train/talk.religion.misc/84202"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"B0LVsN2-psJ3"},"source":["Let's start by implementing a canonical text classification example:\n","\n","* The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums\n","* Bag of Words features extraction with TF-IDF weighting\n","* Naive Bayes classifier or Linear Support Vector Machine for the classifier itself\n","\n","\n"]},{"cell_type":"code","metadata":{"id":"Z71uan7YnWvS"},"source":["from sklearn.datasets import load_files\n","from sklearn.feature_extraction.text import TfidfVectorizer\n","from sklearn.naive_bayes import MultinomialNB\n","\n","# Load the text data\n","categories = [\n"," 'alt.atheism',\n"," 'talk.religion.misc',\n"," 'comp.graphics',\n"," 'sci.space',\n","]\n","twenty_train_small = load_files('20news-bydate-train/',\n"," categories=categories, encoding='latin-1')\n","twenty_test_small = load_files('20news-bydate-test/',\n"," categories=categories, encoding='latin-1')\n","\n","# Turn the text documents into vectors of word frequencies\n","vectorizer = TfidfVectorizer(min_df=2)\n","X_train = vectorizer.fit_transform(twenty_train_small.data)\n","y_train = twenty_train_small.target\n","\n","# Fit a classifier on the training set\n","classifier = MultinomialNB().fit(X_train, y_train)\n","print(\"Training score: {0:.1f}%\".format(\n"," classifier.score(X_train, y_train) * 100))\n","\n","# Evaluate the classifier on the testing set\n","X_test = vectorizer.transform(twenty_test_small.data)\n","y_test = twenty_test_small.target\n","print(\"Testing score: {0:.1f}%\".format(\n"," classifier.score(X_test, y_test) * 100))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"s_1hWM11r5SD"},"source":["!ls -l \n","!ls -lh 20news-bydate-train\n","!ls -lh 20news-bydate-train/alt.atheism/"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"p1zWT5FAqQ4j"},"source":["!cat 20news-bydate-train/alt.atheism/54200"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ZeMBacUfsTte"},"source":["The `load_files` function can load text files from a 2 levels folder structure assuming folder names represent categories:"]},{"cell_type":"code","metadata":{"id":"tQTQgn_fso3V"},"source":["all_twenty_train = load_files('20news-bydate-train/',\n"," encoding='latin-1', random_state=42)\n","all_twenty_test = load_files('20news-bydate-test/',\n"," encoding='latin-1', random_state=42)\n","\n","all_target_names = all_twenty_train.target_names\n","\n","print(all_target_names)\n","print(all_twenty_train.target)\n","print(all_twenty_train.target.shape)\n","\n","print(all_twenty_test.target.shape)\n","print(len(all_twenty_train.data))\n","\n","print(type(all_twenty_train.data[0]))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"7NQd9vfutZvc"},"source":["def display_sample(i, dataset):\n"," print(\"Class name: \" + dataset.target_names[dataset.target[i]])\n"," print(\"Text content:\\n\")\n"," print(dataset.data[i])\n","\n","display_sample(0, all_twenty_train)\n","display_sample(1, all_twenty_train)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Z75YObw5s0su"},"source":["Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8-bit encoding (in this case, all chars can be encoded using the latin-1 charset)."]},{"cell_type":"code","metadata":{"id":"bS-78k9Etpk7"},"source":["def text_size(text, charset='iso-8859-1'):\n"," return len(text.encode(charset)) * 8 * 1e-6\n","\n","train_size_mb = sum(text_size(text) for text in all_twenty_train.data) \n","test_size_mb = sum(text_size(text) for text in all_twenty_test.data)\n","\n","print(\"Training set size: {0} MB\".format(int(train_size_mb)))\n","print(\"Testing set size: {0} MB\".format(int(test_size_mb)))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"velJfdjRt3Qd"},"source":["If we only consider a small subset of the 4 categories selected from the initial example:"]},{"cell_type":"code","metadata":{"id":"TSdec71ut15n"},"source":["train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data) \n","test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)\n","\n","print(\"Training set size: {0} MB\".format(int(train_small_size_mb)))\n","print(\"Testing set size: {0} MB\".format(int(test_small_size_mb)))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"nOpg_Rz_uMyY"},"source":["## Extracting Text Features\n","\n","θα μετατρέψουμε τα κείμενα σε διανύσματα με την μέθοδο διανυσματικής αναπαράστασης tf-idf ([term frequency - inverse document frequency](https://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html)). Το κάθε κείμενο θα γίνει ένα σημείο σε ένα χώρο με διαστάσεις ίση με το μέγεθος του λεξιλόγιου μας (vocabulary). Χρησιμοποιούμε λοιπόν ένα μοντέλο διανυσματικής αναπαράστασης (Vector Space Model -VSM)."]},{"cell_type":"code","metadata":{"id":"J5AEI3lWt7tO"},"source":["from sklearn.feature_extraction.text import TfidfVectorizer\n","\n","TfidfVectorizer()\n","\n","vectorizer = TfidfVectorizer(min_df=1)\n","\n","%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)\n"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"one5nEXsuVAI"},"source":["The results is not a numpy.array but instead a scipy.sparse matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.\n"]},{"cell_type":"code","metadata":{"id":"4RKgGYt1uV48"},"source":["# print(X_train_small) \n","\n","# scipy.sparse matrices also have a shape attribute to access the dimensions:\n","n_samples, n_features = X_train_small.shape\n","\n","# This dataset has around 2000 samples (the rows of the data matrix):\n","print(n_samples)\n","\n","# This is the same value as the number of strings in the original list of text documents:\n","print(len(twenty_train_small.data))\n","\n","# The columns represent the individual token occurrences:\n","print(n_features)\n","\n","# This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:\n","print(type(vectorizer.vocabulary_))\n","print(len(vectorizer.vocabulary_))\n","\n","# The keys of the vocabulary_ attribute are also called feature names and can be accessed as a list of strings.\n","print(len(vectorizer.get_feature_names()))\n","\n","# Here are the first 10 elements (sorted in lexicographical order):\n","print(vectorizer.get_feature_names()[:10])\n","\n","# Let's have a look at the features from the middle:\n","vectorizer.get_feature_names()[int(n_features / 2):int(n_features / 2 + 100)]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"bjDhzTKcvdWP"},"source":["## Οπτικοποίηση των δεδομένων με ανάλυση σε κύριες συνιστώσες (PCA)\n","\n","Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the `TruncatedSVD` class can accept scipy.sparse matrices as input (as an alternative to numpy arrays):"]},{"cell_type":"code","metadata":{"id":"TkImeCyww7bk"},"source":["from sklearn.decomposition import TruncatedSVD\n","\n","%time X_train_small_pca = TruncatedSVD(n_components=2).fit_transform(X_train_small)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"tJY5VTN4xA_f"},"source":["from itertools import cycle\n","\n","colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']\n","for i, c in zip(np.unique(y_train), cycle(colors)):\n"," plt.scatter(X_train_small_pca[y_train == i, 0],\n"," X_train_small_pca[y_train == i, 1],\n"," c=c, label=twenty_train_small.target_names[i], alpha=0.5)\n"," \n","_ = plt.legend(loc='best')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"dIaZVhVmykQQ"},"source":["## Introspecting the Behavior of the Text Vectorizer\n","The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:"]},{"cell_type":"code","metadata":{"id":"F1jjzAeWznEJ"},"source":["\n","TfidfVectorizer()\n","\n","print(TfidfVectorizer.__doc__)\n"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"I1k9N-VZzplI"},"source":["The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer() to get an instance of the text analyzer it uses to process the text:\n"]},{"cell_type":"code","metadata":{"id":"0VVX7hfezuHg"},"source":["analyzer = TfidfVectorizer().build_analyzer()\n","analyzer(\"I love scikit-learn: this is a cool Python lib!\")"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"o8iuhRapzzzE"},"source":["You can notice that all the tokens are lowercase, that the single letter word \"I\" was dropped, and that hyphenation is used. Let's change some of that default behavior:\n"]},{"cell_type":"code","metadata":{"id":"aWoqd1iaz3Az"},"source":["analyzer = TfidfVectorizer(\n"," preprocessor=lambda text: text, # disable lowercasing\n"," token_pattern=r'(?u)\\b[\\w-]+\\b', # treat hyphen as a letter\n"," ).build_analyzer() # do not exclude single letter tokens\n","\n","analyzer(\"I love scikit-learn: this is a cool Python lib!\")"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"NHB658S9z5o3"},"source":["\n","The analyzer name comes from the Lucene parlance: it wraps the sequential application of:\n","\n","text preprocessing (processing the text documents as a whole, e.g. lowercasing)\n","text tokenization (splitting the document into a sequence of tokens)\n","token filtering and recombination (e.g. n-grams extraction, see later)\n","The analyzer system of scikit-learn is much more basic than lucene's though."]}]}