In [None]:
!pip install --upgrade scikit-learn

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True

# Extracting Features from Text

## The 20 Newsgroups Dataset
We download and extract the [20 Newsgroups Dataset](http://qwone.com/~jason/20Newsgroups/)

In [None]:
!wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
!tar xzvf 20news-bydate.tar.gz

In [None]:
!cat 20news-bydate-train/talk.religion.misc/84202

Let's start by implementing a canonical text classification example:

* The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums
* Bag of Words features extraction with TF-IDF weighting
* Naive Bayes classifier or Linear Support Vector Machine for the classifier itself




In [None]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the text data
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
twenty_train_small = load_files('20news-bydate-train/',
    categories=categories, encoding='latin-1')
twenty_test_small = load_files('20news-bydate-test/',
    categories=categories, encoding='latin-1')

# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target

# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))

In [None]:
!ls -l 
!ls -lh 20news-bydate-train
!ls -lh 20news-bydate-train/alt.atheism/

In [None]:
!cat 20news-bydate-train/alt.atheism/54200

The `load_files` function can load text files from a 2 levels folder structure assuming folder names represent categories:

In [None]:
all_twenty_train = load_files('20news-bydate-train/',
    encoding='latin-1', random_state=42)
all_twenty_test = load_files('20news-bydate-test/',
    encoding='latin-1', random_state=42)

all_target_names = all_twenty_train.target_names

print(all_target_names)
print(all_twenty_train.target)
print(all_twenty_train.target.shape)

print(all_twenty_test.target.shape)
print(len(all_twenty_train.data))

print(type(all_twenty_train.data[0]))

In [None]:
def display_sample(i, dataset):
    print("Class name: " + dataset.target_names[dataset.target[i]])
    print("Text content:\n")
    print(dataset.data[i])

display_sample(0, all_twenty_train)
display_sample(1, all_twenty_train)

Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8-bit encoding (in this case, all chars can be encoded using the latin-1 charset).

In [None]:
def text_size(text, charset='iso-8859-1'):
    return len(text.encode(charset)) * 8 * 1e-6

train_size_mb = sum(text_size(text) for text in all_twenty_train.data) 
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)

print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))

If we only consider a small subset of the 4 categories selected from the initial example:

In [None]:
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data) 
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)

print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))

## Extracting Text Features

θα μετατρέψουμε τα κείμενα σε διανύσματα με την μέθοδο διανυσματικής αναπαράστασης tf-idf ([term frequency - inverse document frequency](https://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html)). Το κάθε κείμενο θα γίνει ένα σημείο σε ένα χώρο με διαστάσεις ίση με το μέγεθος του λεξιλόγιου μας (vocabulary). Χρησιμοποιούμε λοιπόν ένα μοντέλο διανυσματικής αναπαράστασης (Vector Space Model -VSM).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer()

vectorizer = TfidfVectorizer(min_df=1)

%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)


The results is not a numpy.array but instead a scipy.sparse matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.


In [None]:
# print(X_train_small) 

# scipy.sparse matrices also have a shape attribute to access the dimensions:
n_samples, n_features = X_train_small.shape

# This dataset has around 2000 samples (the rows of the data matrix):
print(n_samples)

# This is the same value as the number of strings in the original list of text documents:
print(len(twenty_train_small.data))

# The columns represent the individual token occurrences:
print(n_features)

# This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:
print(type(vectorizer.vocabulary_))
print(len(vectorizer.vocabulary_))

# The keys of the vocabulary_ attribute are also called feature names and can be accessed as a list of strings.
print(len(vectorizer.get_feature_names()))

# Here are the first 10 elements (sorted in lexicographical order):
print(vectorizer.get_feature_names()[:10])

# Let's have a look at the features from the middle:
vectorizer.get_feature_names()[int(n_features / 2):int(n_features / 2 + 100)]

## Οπτικοποίηση των δεδομένων με ανάλυση σε κύριες συνιστώσες (PCA)

Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the `TruncatedSVD` class can accept scipy.sparse matrices as input (as an alternative to numpy arrays):

In [None]:
from sklearn.decomposition import TruncatedSVD

%time X_train_small_pca = TruncatedSVD(n_components=2).fit_transform(X_train_small)

In [None]:
from itertools import cycle

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
    plt.scatter(X_train_small_pca[y_train == i, 0],
               X_train_small_pca[y_train == i, 1],
               c=c, label=twenty_train_small.target_names[i], alpha=0.5)
    
_ = plt.legend(loc='best')

## Introspecting the Behavior of the Text Vectorizer
The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:

In [None]:

TfidfVectorizer()

print(TfidfVectorizer.__doc__)


The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer() to get an instance of the text analyzer it uses to process the text:


In [None]:
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")

You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:


In [None]:
analyzer = TfidfVectorizer(
    preprocessor=lambda text: text,  # disable lowercasing
    token_pattern=r'(?u)\b[\w-]+\b', # treat hyphen as a letter
    ).build_analyzer() # do not exclude single letter tokens

analyzer("I love scikit-learn: this is a cool Python lib!")


The analyzer name comes from the Lucene parlance: it wraps the sequential application of:

text preprocessing (processing the text documents as a whole, e.g. lowercasing)
text tokenization (splitting the document into a sequence of tokens)
token filtering and recombination (e.g. n-grams extraction, see later)
The analyzer system of scikit-learn is much more basic than lucene's though.