{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Άσκηση text mining.ipynb","provenance":[]},"kernelspec":{"name":"python3","display_name":"Python 3"},"accelerator":"TPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"CvdzwUp0mebD"},"source":["# Άσκηση: μελέτη επίδρασης προεπεξεργασίας στη διαστατικότητα και την ομαδοποίηση\n","\n","\n","\n","Αρχικά κάνουμε update και import τις βιβλιοθήκες που θα χρειαστούμε."]},{"cell_type":"code","metadata":{"id":"35-RlmEMmebG","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637501964585,"user_tz":-120,"elapsed":14216,"user":{"displayName":"Giorgos Siolas","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjxnZOAObbc3X0z9X2rs1N_1geznqhrotkq3KF-p_M=s64","userId":"10127542075805046236"}},"outputId":"ff93079e-c91e-4ed3-a191-f7800df0862c"},"source":["!pip install --upgrade scikit-learn\n","!pip install --upgrade numpy\n","!pip install --upgrade scipy\n","!pip install --upgrade nltk"],"execution_count":1,"outputs":[{"output_type":"stream","name":"stdout","text":["Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.1)\n","Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.7.2)\n","Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.1.0)\n","Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.4)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.4)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (1.7.2)\n","Requirement already satisfied: numpy<1.23.0,>=1.16.5 in /usr/local/lib/python3.7/dist-packages (from scipy) (1.21.4)\n","Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.6.5)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from nltk) (1.1.0)\n","Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from nltk) (7.1.2)\n","Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from nltk) (4.62.3)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.7/dist-packages (from nltk) (2021.11.10)\n"]}]},{"cell_type":"code","metadata":{"id":"aoGLewGiEomz","executionInfo":{"status":"ok","timestamp":1637501965000,"user_tz":-120,"elapsed":432,"user":{"displayName":"Giorgos Siolas","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjxnZOAObbc3X0z9X2rs1N_1geznqhrotkq3KF-p_M=s64","userId":"10127542075805046236"}}},"source":["import numpy as np\n","from sklearn.feature_extraction.text import TfidfVectorizer\n","from sklearn.cluster import KMeans\n","from sklearn.metrics import silhouette_score\n","import matplotlib.pyplot as plt\n","%matplotlib inline"],"execution_count":2,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ua5-z6r4mebN"},"source":["Φέρνουμε το σύνολο κειμένων μας (Reuters)."]},{"cell_type":"code","metadata":{"id":"zCUw7JWfmebP","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637501966624,"user_tz":-120,"elapsed":1628,"user":{"displayName":"Giorgos Siolas","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjxnZOAObbc3X0z9X2rs1N_1geznqhrotkq3KF-p_M=s64","userId":"10127542075805046236"}},"outputId":"61d05125-74f5-4e27-9442-a529710b17ea"},"source":["import nltk\n","\n","\n","nltk.download('reuters') # κατεβάζουμε το dataset\n","from nltk.corpus import reuters # το κάνουμε import\n","\n","# List of document ids\n","documents = reuters.fileids()\n"," \n","train_docs_id = list(filter(lambda doc: doc.startswith(\"train\"),\n"," documents))\n","\n","train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id]"],"execution_count":3,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package reuters to /root/nltk_data...\n","[nltk_data] Package reuters is already up-to-date!\n"]}]},{"cell_type":"markdown","metadata":{"id":"eajysVOHv_22"},"source":["Τα δεδομένα μας θα είναι τα πρώτα 500 έγγραφα του train set"]},{"cell_type":"code","metadata":{"id":"4TQkSfnmv6xH","executionInfo":{"status":"ok","timestamp":1637501966625,"user_tz":-120,"elapsed":15,"user":{"displayName":"Giorgos Siolas","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjxnZOAObbc3X0z9X2rs1N_1geznqhrotkq3KF-p_M=s64","userId":"10127542075805046236"}}},"source":["data = train_docs[:500]"],"execution_count":4,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"dcWNXljdmebU"},"source":["Θα χρησιμοποιήσουμε το [TF-IDF Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) του scikit-learn προκειμένου να μετατρέψουμε τα κείμενά μας σε έναν πίνακα διανυσμάτων. Καλείστε να πειραματιστείτε και να επιλέξετε εσείς τα ορίσματα του TF-IDF. Στη συνέχεια θα χρησιμοποιήσουμε τον αλγόριθμο k-means και τη μετρική silhouette για να εντοπίσουμε τον αριθμό των ομάδων στις οποίες ανήκουν τα κείμενα. Για να μπορέσουμε να δούμε και ορισμένα στοιχεία για τις θεματικές ενότητες στις οποίες ανήκουν τα κείμενα, θα τυπώσουμε τους πιο σημαντικούς όρους της κάθε ομάδας.\n","\n","1. Δοκιμάστε να τρέξετε τον TF-IDF Vectorizer χωρίς παραμέτρους. Τι διαστάσεις έχει το διάνυσμα TF-IDF; Τα αποτελέσματα βγάζουν νόημα;\n","2. Δοκιμάστε να τρέξετε τον Vectorizer, αφαιρώντας τα english stopwords (παράμετρος stop_words='english'). Τι διαστάσεις έχει τώρα το διάνυσμα; Βελτιώθηκε καθόλου η ποιότητα του clustering;\n","3. Δοκιμάστε να τρέξετε τον Vectorizer, αφαιρώντας και τους όρους που εμφανίζονται σε λιγότερα από 10 documents (παράμετρος min_df=10). Τι διαστάσεις έχει το διάνυσμα; Βελτιώθηκε το clustering;\n","4. Δοκιμάστε να τρέξετε τον Vectorizer, αφαιρώντας και τους όρους που εμφανίζονται σε περισσότερα από το 50% των κειμένων (παράμετρος max_df=0.5). Βελτιώνει αυτό το clustering;\n","5. Δοκιμάστε να βελτιώσετε περαιτέρω την ποιότητα του clustering. Αυξήστε το μέγιστο k σε 50. Σε πόσες κατηγορίες χωρίζεται το σύνολο των κειμένων μας; Ποια είναι η θεματική ενότητα της κάθε κατηγορίας; Πως θα μπορούσαν να βελτιωθούν περισσότερο τα αποτελέσματα (πχ τί λείπει από τον TfidfVectorizer);"]},{"cell_type":"markdown","metadata":{"id":"GxH5xO8B06gR"},"source":["## 1. \n","διαστάσεις (500, 6998)"]},{"cell_type":"code","metadata":{"id":"WXIlUPsLvnPV","colab":{"base_uri":"https://localhost:8080/","height":334},"executionInfo":{"status":"ok","timestamp":1637502443735,"user_tz":-120,"elapsed":60053,"user":{"displayName":"Giorgos Siolas","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjxnZOAObbc3X0z9X2rs1N_1geznqhrotkq3KF-p_M=s64","userId":"10127542075805046236"}},"outputId":"a0f567b4-0f9a-4e10-c43e-7763ecc1e101"},"source":["vectorizer = TfidfVectorizer()\n","\n","tf_idf_array = vectorizer.fit_transform(data).toarray()\n","print('TF-IDF array shape:', tf_idf_array.shape)\n","\n","silhouette_scores = []\n","for k in range(2, 20):\n"," km = KMeans(k)\n"," preds = km.fit_predict(tf_idf_array)\n"," silhouette_scores.append(silhouette_score(tf_idf_array, preds))\n"," \n","plt.plot(range(2, 20), silhouette_scores)\n","best_k = np.argmax(silhouette_scores) + 2 # +2 γιατί ξεκινάμε το range() από k=2 και όχι από 0 που ξεκινάει η αρίθμηση της λίστας\n","plt.scatter(best_k, silhouette_scores[best_k-2], color='r') # για τον ίδιο λόγο το καλύτερο k είναι αυτό 2 θέσεις παρακάτω από το index της λίστας\n","plt.xlim([2,19])\n","plt.annotate(\"best k\", xy=(best_k, silhouette_scores[best_k-2]), xytext=(5, silhouette_scores[best_k-2]),arrowprops=dict(arrowstyle=\"->\")) # annotation\n","print('Maximum average silhouette score for k =', best_k)\n","\n","km = KMeans(best_k)\n","km.fit(tf_idf_array)\n","terms = vectorizer.get_feature_names_out()\n","order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n","for i in range(best_k):\n"," out = \"Cluster %d:\" % i\n"," for ind in order_centroids[i, :10]:\n"," out += ' %s' % terms[ind]\n"," print(out)"],"execution_count":6,"outputs":[{"output_type":"stream","name":"stdout","text":["TF-IDF array shape: (500, 6998)\n","Maximum average silhouette score for k = 2\n","Cluster 0: the to of in said and pct for it dlrs\n","Cluster 1: vs cts loss mln net 000 shr revs profit qtr\n"]},{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"fITDN0cmtMQW"},"source":["## 2-5. "]}]}