{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"}},"cells":[{"cell_type":"code","metadata":{"id":"QJNKuWwUYzsJ","colab":{"base_uri":"https://localhost:8080/","height":800},"executionInfo":{"status":"ok","timestamp":1668433051424,"user_tz":-120,"elapsed":43701,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"d35a1cf7-1b46-433a-bac1-c46fc5258b7f"},"source":["!pip install --upgrade scikit-learn\n","!pip install --upgrade numpy\n","!pip install --upgrade scipy\n","!pip install --upgrade nltk\n","!pip install --upgrade matplotlib\n","%matplotlib inline"],"execution_count":1,"outputs":[{"output_type":"stream","name":"stdout","text":["Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n","Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.2)\n","Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.7.3)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.1.0)\n","Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.6)\n","Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.2.0)\n","Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n","Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.6)\n","Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n","Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (1.7.3)\n","Requirement already satisfied: numpy<1.23.0,>=1.16.5 in /usr/local/lib/python3.7/dist-packages (from scipy) (1.21.6)\n","Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n","Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.7)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.7/dist-packages (from nltk) (2022.6.2)\n","Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from nltk) (4.64.1)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from nltk) (1.2.0)\n","Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from nltk) (7.1.2)\n","Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n","Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (3.2.2)\n","Collecting matplotlib\n"," Downloading matplotlib-3.5.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)\n","\u001b[K |████████████████████████████████| 11.2 MB 4.2 MB/s \n","\u001b[?25hRequirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.4.4)\n","Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (7.1.2)\n","Collecting fonttools>=4.22.0\n"," Downloading fonttools-4.38.0-py3-none-any.whl (965 kB)\n","\u001b[K |████████████████████████████████| 965 kB 52.8 MB/s \n","\u001b[?25hRequirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (3.0.9)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (21.3)\n","Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.21.6)\n","Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (0.11.0)\n","Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (2.8.2)\n","Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib) (4.1.1)\n","Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7->matplotlib) (1.15.0)\n","Installing collected packages: fonttools, matplotlib\n"," Attempting uninstall: matplotlib\n"," Found existing installation: matplotlib 3.2.2\n"," Uninstalling matplotlib-3.2.2:\n"," Successfully uninstalled matplotlib-3.2.2\n","Successfully installed fonttools-4.38.0 matplotlib-3.5.3\n"]},{"output_type":"display_data","data":{"application/vnd.colab-display-data+json":{"pip_warning":{"packages":["matplotlib","mpl_toolkits"]}}},"metadata":{}}]},{"cell_type":"markdown","metadata":{"id":"j3AJwwzFsa6f"},"source":["\n","# Εξόρυξη κειμένου (Text Mining)\n","\n","To Text Mining είναι ένα σύνολο αυτόματων (μηχανικών) τεχνικών που στοχεύουν στην εξαγωγή υψηλής ποιότητας πληροφορίας από κειμενική πληροφορία. Ο ασυμπτωτικός ορίζοντας του text mining είναι η συνολική σημασιολογική κατανόηση του ανθρώπινου λόγου, κάτι όμως που ανήκει στους δύσκολους στόχους της λεγόμενης **Ισχυρής Τεχνητής Νοημοσύνης** (*Strong AI*). Στο δρόμο για την επίτευξη αυτού του στόχου η έρευνα στο text mining επικεντρώνεται σε μια σειρά πιο συγκεκριμένων και άρα περισσότερο προσιτών στόχων - tasks (*Weak AI*) όπως (μεταξύ άλλων):\n","- **Κατηγοριοποίηση κειμένων** (*text categorization*) - Ταξινόμηση με βάση το περιεχόμενο σε συγκεκριμένες θεματικές κατηγορίες\n","- **Συσταδοποίηση** (*text clustering*) - Συσταδοποίηση \"κοντινών\" σημασιολογικά κειμένων\n","- **Εξαγωγή θεμάτων** (*topic extraction*) - Ανακάλυψη των θεμάτων που περιέχει ένα κείμενο\n","- **Εξαγωγή εννοιών και οντοτήτων** (*concept/entity extraction*) - Σε ποιες έννοιες και οντότητες του φυσικού κόσμου αναφέρεται το κείμενο.\n","- **Ανάλυση συναισθήματος** (*sentiment analysis*) - Χαρακτηρισμός του συναισθήματος\n","- **Αυτόματη περίληψη** (*document summarization*) - Δημιουργία αυτόματης περίληψης\n","- **Μοντελοποιηση σχέσεων μεταξύ οντοτήτων** (*entity relation modeling*) - Ποιες σχέσεις διέπουν τις οντότητες που εντοπίζονται εντός του κειμένου.\n","- **Απάντηση ερωτήσεων** *(question answering*) - απάντηση ερώτησης και τα δύο σε φυσική γλώσσα\n","\n","\n","Στην εξόρυξη κειμένου συνδυάζονται τεχνικές και προσεγγίσεις που προέρχονται από τη θεωρία της πληροφορίας και τη στατιστική, την αναγνώριση προτύπων, την εξόρυξη δεδομένων, τη μηχανική μάθηση, την ανάκτηση πληροφορίας, την επεξεργασία φυσικής γλώσσας (Natural Language Processing - NLP), τη γλωσσολογία, την αναπαράσταση γνώσεων, τις οντολογίες κ.α."]},{"cell_type":"markdown","metadata":{"id":"fmKJbF4UVJOL"},"source":["Για την εξόρυξη κειμένου και την επεξεργασία φυσικής γλώσσας θα βασιστούμε στο [Natural Language Toolkit](http://www.nltk.org/) της Python\n","\n","Περισσότερα θα δείτε στο [nltk book](http://www.nltk.org/book/)"]},{"cell_type":"code","metadata":{"id":"64d0RcciVJOM","executionInfo":{"status":"ok","timestamp":1668433739096,"user_tz":-120,"elapsed":1143,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["import numpy as np\n","import nltk"],"execution_count":5,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"a_RhBsaIVJOR"},"source":["## Εισαγωγή κειμένων στο notebook\n","\n","\n","Το NLTK από μόνο του έχει μόνο τις πολύ βασικές λειτουργίες. Για πιο σύνθετα πράγματα (τα οποία θα χρειαστούμε) χρειάζεται να κατεβάσουμε επιπλέον δυνατότητες της βιβλιοθήκης. Όταν τρέχουμε τοπικά την Python, αυτό μπορούμε να το κάνουμε μέσω της εντολής `nltk.download()`, η οποία ανοίγει ένα παράθυρο όπου επιλέγουμε ποιες λειτουργίες μας ενδιαφέρει να κατεβάσουμε. Σε κάποιες cloud πλατφόρμες αυτό δεν είναι δυνατό, γι' αυτό πρέπει να τα κατεβάζουμε ένα ένα τα επιπλέον πακέτα, όπως θα δούμε παρακάτω."]},{"cell_type":"markdown","metadata":{"id":"yoaSYPnWoL5C"},"source":["\n","\n","### Από βιβλιοθήκες της Python\n","\n","Στα πλαίσια της άσκησης θα χρησιμοποιήσουμε το [reuters dataset](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection). \n","\n","Το σώμα (corpus) κειμένων Reuters περιέχει 10788 κείμενα ειδήσεων. Κάθε κείμενο (document) ανήκει σε μία ή περισσότερες από 90 θεματικές κατηγορίες ειδήσεων που έχουν να κάνουν κυρίως με εμπορικά και χρηματιστηριακά αγαθά και υπηρεσίες (πχ \"fuel\", \"cotton\", \"ship\" κλπ). To Reuters είναι ήδη χωρισμένο σε train και test set.\n","Μπορούμε να εισάγουμε το Reuters μέσω του NLTK:\n"]},{"cell_type":"code","metadata":{"id":"RWZXceaPVJOR","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433741700,"user_tz":-120,"elapsed":422,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"a32c6628-ff0a-401b-f1b6-b28901a973a8"},"source":["nltk.download('reuters') # κατεβάζουμε το dataset\n","\n","from nltk.corpus import reuters # το κάνουμε import\n"],"execution_count":6,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package reuters to /root/nltk_data...\n","[nltk_data] Package reuters is already up-to-date!\n"]}]},{"cell_type":"markdown","metadata":{"id":"MK0sHMNDc9Br"},"source":["\n","και τυπώνουμε κάποια βασικά χαρακτηριστικά:\n","\n"]},{"cell_type":"code","metadata":{"id":"Yp4tsAJLclo6","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433747032,"user_tz":-120,"elapsed":1794,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"b38489d2-cbb7-4151-d22b-089ef5b6c1de"},"source":["def collection_stats():\n"," # List of documents\n"," documents = reuters.fileids()\n"," print(str(len(documents)) + \" documents\");\n"," \n"," train_docs = list(filter(lambda doc: doc.startswith(\"train\"),\n"," documents));\n"," print(str(len(train_docs)) + \" total train documents\");\n"," \n"," test_docs = list(filter(lambda doc: doc.startswith(\"test\"),\n"," documents));\n"," print(str(len(test_docs)) + \" total test documents\");\n"," \n"," # List of categories\n"," categories = reuters.categories();\n"," print(str(len(categories)) + \" categories\");\n","\n","collection_stats()\n","print(reuters.categories()[:20], '...')"],"execution_count":7,"outputs":[{"output_type":"stream","name":"stdout","text":["10788 documents\n","7769 total train documents\n","3019 total test documents\n","90 categories\n","['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr'] ...\n"]}]},{"cell_type":"markdown","metadata":{"id":"T_qroRjPbhcS"},"source":["Για ένα τυχαίο document μπορούμε να δούμε τις κατηγορίες που ανήκει και το ίδιο το κείμενο χρησιμοποιώντας το id του:"]},{"cell_type":"code","metadata":{"id":"TtI_Oqxsb8uJ","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433752977,"user_tz":-120,"elapsed":594,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"a0b46a54-e180-4112-c332-72a05586514c"},"source":["def describe_doc(document_id):\n"," # Raw categories\n"," print(\"Categories\")\n"," doc_categories = reuters.categories(document_id) \n"," print(doc_categories)\n"," # Raw document\n"," print(\"Document\")\n"," print(reuters.raw(document_id));\n","\n","doc_id = 'training/9880'\n","describe_doc(doc_id)\n","doc_id = 'training/9865'\n","describe_doc(doc_id)\n"],"execution_count":8,"outputs":[{"output_type":"stream","name":"stdout","text":["Categories\n","['money-fx']\n","Document\n","U.K. MONEY MARKET GETS 25 MLN STG LATE HELP\n"," The Bank of England said it provided\n"," about 25 mln stg in late help to the money market, bringing the\n"," total assistance today to 266 mln stg.\n"," This compares with the bank's revised estimate of a 350 mln\n"," stg money market shortfall.\n"," \n","\n","\n","Categories\n","['barley', 'corn', 'grain', 'wheat']\n","Document\n","FRENCH FREE MARKET CEREAL EXPORT BIDS DETAILED\n"," French operators have requested licences\n"," to export 675,500 tonnes of maize, 245,000 tonnes of barley,\n"," 22,000 tonnes of soft bread wheat and 20,000 tonnes of feed\n"," wheat at today's European Community tender, traders said.\n"," Rebates requested ranged from 127.75 to 132.50 European\n"," Currency Units a tonne for maize, 136.00 to 141.00 Ecus a tonne\n"," for barley and 134.25 to 141.81 Ecus for bread wheat, while\n"," rebates requested for feed wheat were 137.65 Ecus, they said.\n"," \n","\n","\n"]}]},{"cell_type":"markdown","metadata":{"id":"yqg9kYW7moPW"},"source":["Το Reuters χρησιμοποιείται συχνά για την μελέτη της απόδοσης αλγορίθμων Μηχανικής Μάθησης στην κατηγοριοποίηση ή ομάδοποιήση κειμένων."]},{"cell_type":"markdown","metadata":{"id":"kcNsEV60VJOd"},"source":["### Από το internet"]},{"cell_type":"code","metadata":{"id":"cMLHmozBVJOe","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433771673,"user_tz":-120,"elapsed":3296,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"723a4f52-27de-4bc0-c641-a77c3ef46a94"},"source":["import urllib\n","\n","# ορίζουμε το url που περιέχει το κείμενο (εδώ το Moby Dick)\n","url = 'https://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/2/7/0/2701/2701-0.txt'\n","\n","with urllib.request.urlopen(url) as response:\n"," raw = response.read()\n","\n","#τυπώνουμε ένα κομμάτι του κειμένου\n","text_chunk=raw[10000:11000]\n","print(text_chunk)"],"execution_count":9,"outputs":[{"output_type":"stream","name":"stdout","text":["b'_.\\r\\n\\r\\n \\xe2\\x80\\x9cAnd what thing soever besides cometh within the chaos of this\\r\\n monster\\xe2\\x80\\x99s mouth, be it beast, boat, or stone, down it goes all\\r\\n incontinently that foul great swallow of his, and perisheth in the\\r\\n bottomless gulf of his paunch.\\xe2\\x80\\x9d \\xe2\\x80\\x94_Holland\\xe2\\x80\\x99s Plutarch\\xe2\\x80\\x99s Morals_.\\r\\n\\r\\n \\xe2\\x80\\x9cThe Indian Sea breedeth the most and the biggest fishes that are:\\r\\n among which the Whales and Whirlpooles called Balaene, take up as\\r\\n much in length as four acres or arpens of land.\\xe2\\x80\\x9d \\xe2\\x80\\x94_Holland\\xe2\\x80\\x99s Pliny_.\\r\\n\\r\\n \\xe2\\x80\\x9cScarcely had we proceeded two days on the sea, when about sunrise a\\r\\n great many Whales and other monsters of the sea, appeared. Among the\\r\\n former, one was of a most monstrous size.... This came towards us,\\r\\n open-mouthed, raising the waves on all sides, and beating the sea\\r\\n before him into a foam.\\xe2\\x80\\x9d \\xe2\\x80\\x94_Tooke\\xe2\\x80\\x99s Lucian_. \\xe2\\x80\\x9c_The True History_.\\xe2\\x80\\x9d\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n \\xe2\\x80\\x9cHe visited this country also with a view of catching horse-whales,\\r\\n which had bones of very great value for the'\n"]}]},{"cell_type":"markdown","metadata":{"id":"sOIlekSrk7W4"},"source":["Το κείμενο είναι σε κωδικοποίηση Unicode UTF-8. Θα το μετατρέψουμε σε εκτυπώσιμους χαρακτήρες."]},{"cell_type":"code","metadata":{"id":"lKb1lugrj9qU","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433772216,"user_tz":-120,"elapsed":6,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"1c7dd935-4ca7-4bd4-d52b-2986c823ceb8"},"source":["s = text_chunk.decode('utf-8')\n","print(s)"],"execution_count":10,"outputs":[{"output_type":"stream","name":"stdout","text":["_.\r\n","\r\n"," “And what thing soever besides cometh within the chaos of this\r\n"," monster’s mouth, be it beast, boat, or stone, down it goes all\r\n"," incontinently that foul great swallow of his, and perisheth in the\r\n"," bottomless gulf of his paunch.” —_Holland’s Plutarch’s Morals_.\r\n","\r\n"," “The Indian Sea breedeth the most and the biggest fishes that are:\r\n"," among which the Whales and Whirlpooles called Balaene, take up as\r\n"," much in length as four acres or arpens of land.” —_Holland’s Pliny_.\r\n","\r\n"," “Scarcely had we proceeded two days on the sea, when about sunrise a\r\n"," great many Whales and other monsters of the sea, appeared. Among the\r\n"," former, one was of a most monstrous size.... This came towards us,\r\n"," open-mouthed, raising the waves on all sides, and beating the sea\r\n"," before him into a foam.” —_Tooke’s Lucian_. “_The True History_.”\r\n","\r\n","\r\n","\r\n","\r\n"," “He visited this country also with a view of catching horse-whales,\r\n"," which had bones of very great value for the\n"]}]},{"cell_type":"markdown","metadata":{"id":"EKDKy_NMVJOh"},"source":["### Από τοπικό αρχείο \n","\n","Έστω ότι έχω ένα αρχείο στον υπολογιστή μου με όνομα `mydoc.txt` (κατεβάστε ένα από [εδώ](https://drive.google.com/uc?export=download&id=1WF31UdA9kmM5vmqgtvjTxoW0hHi0CmAV)). Αυτό πρέπει πρώτα να το ανεβάσουμε στο περιβάλλον του notebook. Γενικά αυτή η διαδικασία διαφέρει από cloud σε cloud, οπότε ο παρακάτω κώδικας θα τρέξει μόνο σε περιβάλλον Google Colaboratory. Αντίστοιχες διαδικασίες ανεβάσματος αρχείου από τον τοπικό υπολογιστή υπάρχουν και για τα υπόλοιπα cloud που έχουμε δείξει."]},{"cell_type":"code","metadata":{"id":"drJ5uCxVnZDp","colab":{"base_uri":"https://localhost:8080/","height":90},"executionInfo":{"status":"ok","timestamp":1668433673187,"user_tz":-120,"elapsed":6566,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"c6aa7745-99f2-4e8e-f8da-6d25264ddac3"},"source":["from google.colab import files\n","\n","uploaded = files.upload()\n","\n","for fn in uploaded.keys():\n"," print('User uploaded file \"{name}\" with length {length} bytes'.format(\n"," name=fn, length=len(uploaded[fn])))"],"execution_count":1,"outputs":[{"output_type":"display_data","data":{"text/plain":[""],"text/html":["\n"," \n"," \n"," Upload widget is only available when the cell has been executed in the\n"," current browser session. Please rerun this cell to enable.\n"," \n"," "]},"metadata":{}},{"output_type":"stream","name":"stdout","text":["Saving mydoc.txt to mydoc.txt\n","User uploaded file \"mydoc.txt\" with length 1042 bytes\n"]}]},{"cell_type":"code","metadata":{"id":"NzYTrrzBndhX","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433789670,"user_tz":-120,"elapsed":469,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"58828da5-6c73-4bfc-c38c-7c7d0383ddde"},"source":["!ls"],"execution_count":11,"outputs":[{"output_type":"stream","name":"stdout","text":["mydoc.txt sample_data\n"]}]},{"cell_type":"markdown","metadata":{"id":"Oo4gforhnyHw"},"source":["Διαβάζω το περιεχόμενο του αρχείο μέσα στο string \"document\""]},{"cell_type":"code","metadata":{"id":"uJQ97LXDVJOp","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433790540,"user_tz":-120,"elapsed":61,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"a5803aef-44ed-4bae-8352-3f18773ba1ea"},"source":["with open('mydoc.txt', 'r') as f:\n"," document = ''\n"," for line in f:\n"," document += line\n","\n","print(document)"],"execution_count":12,"outputs":[{"output_type":"stream","name":"stdout","text":["Commerce Secretary Malcolm Baldrige\n","said he supported efforts to persuade newly-industrialized\n","countries (NICS) to revalue currencies that are tied to the\n","dollar in order to help the United States cut its massive trade\n","deficit.\n"," \"We do need to do something with those currencies or we\n","will be substituting Japanese products for Taiwanese products,\"\n","or those of other nations with currencies tied to the dollar,\n","Baldrige told a House banking subcommittee.\n"," The U.S. dollar has declined in value against the Yen and\n","European currencies, but has changed very little against the\n","currencies of some developing countries such as South Korea and\n","Taiwan because they are linked to the value of the dollar.\n"," As a result, efforts to reduce the value of the dollar over\n","the past year and a half have done little to improve the trade\n","deficits with those countries.\n"," Baldrige told a House Banking subcommittee that the\n","Treasury Department was attempting to persuade those countries\n","to reach agreement with the United States on exchange rates.\n","\n"]}]},{"cell_type":"markdown","metadata":{"id":"XnlcPw_ivn_p"},"source":["## Μοντέλο διανυσματικού χώρου (Vector Space Model)\n","\n","Ως σημείο εκκίνησης λαμβάνουμε ότι διαθέτουμε μια συλλογή από κείμενα (αρχεία text) και ότι οι αλγόριθμοι μηχανικής μάθησης που χρησιμοποιούμε λαμβάνουν στην είσοδο αριθμητικές τιμές (διανύσματα). Ένα πρώτο και πολύ βασικό ερώτημα λοιπόν είναι πως μπορούμε να μετατρέψουμε τα κείμενα σε κατάλληλη διανυσματική μορφή. Τί θα αποτελούσε όμως \"κατάλληλη διανυσματική μορφή\";\n","\n","Μια απάντηση που μπορούμε να δώσουμε από την υπολογιστική σκοπιά είναι ότι αν κάθε κείμενο της συλλογής μετατραπεί σε ένα διάνυσμα, θα θέλαμε η μετατροπή αυτή να κρατήσει τη σημασιολογική πληροφορία των κειμένων έτσι ώστε κείμενα που το κειμενικό τους περιεχόμενο είναι σημασιολογικά \"κοντινό\" (μιλάνε για κοντινά θέματα) να αντιστοιχούν σε σημεία του διανυσματικού χώρου αναπαράστασης που είναι κοντά μεταξύ τους και το αντίστροφο για κείμενα με ανόμοιο περιεχόμενο.\n"]},{"cell_type":"markdown","metadata":{"id":"vOY46Yu_B0Cu"},"source":["\n","### Σάκος λέξεων (bag of words)\n","\n","Ας θεωρήσουμε χωρίς βλάβη της γενικότητας την ακόλουθη μικρή συλλογή κειμένων (documents): \n","\n","d1 = \"a big black cat\"\n","\n","d2 = \"a cat and a dog\"\n","\n","d3 = \"a lovely town\"\n","\n","Τα d1 και d2 έχουν μεταξύ τους κοινό σημασιολογικό περιεχόμενο και δεν έχουν με το d3. Κατασκευάζουμε ένα διάνυσμα του οποίου κάθε χαρακτηριστικό είναι κάθε μοναδική λέξη της συλλογής μας σε αλφαβητική σειρά δηλαδή:\n","\n","\\[ a and big black cat dog lovely town \\]\n","\n","Με βάση αυτό τα 8 χαρακτηριστικά τώρα, αναπαριστούμε κάθε document με ένα διάνυσμα όπου τα χαρακτηριστικά λαμβάνουν τιμές ίσες με τη συχνότητα εμφάνισης της κάθε λέξης (term frequency) στο συγκεκριμένο document: \n","\n","d1 = \\[ 1 0 1 1 1 0 0 0 \\]\n","\n","d2 = \\[ 2 1 0 0 1 1 0 0 \\]\n","\n","d3 = \\[ 1 0 0 0 0 0 1 1 \\]\n","\n","Αυτή είναι το βασικό μοντέλο (αναπαράστασης) διανυσματικού χώρου που χρησιμοποιεί τις συχνότητες εμφάνισης κάθε λέξης. Εξαιτίας του γεγονότος ότι αγνοούμε τη σειρά των λέξεων (το \"a big black cat\" έχει την ίδια διανυσματική αναπαράσταση με το \"cat big a black\") το ονομάζουμε σάκο λέξεων - bag of words (BOW). \n","\n","Ας τα περάσουμε στο numpy και να δοκιμάσουμε να υπολογίσουμε αποστάσεις μεταξύ διανυσμάτων:"]},{"cell_type":"code","metadata":{"id":"9CIQMyt4Gv6q","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433790541,"user_tz":-120,"elapsed":54,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"bf1a72e6-ab50-45af-dd69-17fb7fa64b1f"},"source":["d1 = np.array([1,0,1,1,1,0,0,0])\n","d2 = np.array([2,1,0,0,1,1,0,0])\n","d3 = np.array([1,0,0,0,0,0,1,1])\n","\n","#Ευκλείδεια απόσταση\n","\n","d1d2 = print(\"d1 με d2\", np.linalg.norm(d1-d2))\n","d1d3 = print(\"d1 με d3\", np.linalg.norm(d1-d3))\n","d2d3 = print(\"d2 με d3\", np.linalg.norm(d2-d3))"],"execution_count":13,"outputs":[{"output_type":"stream","name":"stdout","text":["d1 με d2 2.23606797749979\n","d1 με d3 2.23606797749979\n","d2 με d3 2.449489742783178\n"]}]},{"cell_type":"markdown","metadata":{"id":"7pScjhAZJMTM"},"source":["Συνεπώς βλέπουμε ότι ακόμα δεν έχουμε πετύχει αυτό που θέσαμε ως αρχικό στόχο, τα d1 και d2 να είναι πιο κοντά μεταξύ τους απ' ότι είνα με το d3. Στην πράξη δεν χρησιμοποιούμε την ευκλείδεια απόσταση γιατί για παράδειγμα αν πάρουμε το\n","\n","d4 = \"a big black cat a big black cat\"\n","\n","που δεν έχει καμία σημασιολογική μονάδα πληροφορίας (λέξη) διαφορετική από το d1 (και άρα θα έπρεπε να έχει απόσταση 0) το οποίο έχει διανυσματική αναπαράσταση το διπλάσιο του d1\n","\n","d4 = \\[ 2 0 2 2 2 0 0 0 \\]\n","\n","Με ευκλείδεια απόσταση θα λάβουμε:"]},{"cell_type":"code","metadata":{"id":"HlsolnvpLSBs","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433790543,"user_tz":-120,"elapsed":48,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"405b1c82-8a2d-43d2-aea3-8fb75847f328"},"source":["d4 = np.array([2,0,2,2,2,0,0,0])\n","d1d4 = print(\"d1 με d4\", np.linalg.norm(d1-d4))"],"execution_count":14,"outputs":[{"output_type":"stream","name":"stdout","text":["d1 με d4 2.0\n"]}]},{"cell_type":"markdown","metadata":{"id":"Z6Zh4ryILpPP"},"source":["### Ομοιότητα συνημιτόνόυ (cosine similarity)\n","\n","Για το λόγο αυτό, στο Vector Space Model χρησιμοποιούμε την απόσταση (ή ομοιότητα) συνημιτόνου (cosine similarity): \n","\n","$ {\\text{similarity}}=\\cos(\\theta )={\\mathbf {A} \\cdot \\mathbf {B} \\over \\|\\mathbf {A} \\|\\|\\mathbf {B} \\|}={\\frac {\\sum \\limits _{i=1}^{n}{A_{i}B_{i}}}{{\\sqrt {\\sum \\limits _{i=1}^{n}{A_{i}^{2}}}}{\\sqrt {\\sum \\limits _{i=1}^{n}{B_{i}^{2}}}}}}$\n"]},{"cell_type":"code","metadata":{"id":"FcoNf4CeL9Wy","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433790544,"user_tz":-120,"elapsed":42,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"75415847-8e9c-4b86-a4dd-4221a7052463"},"source":["import scipy as sp\n","\n","cosd1d4 = sp.spatial.distance.cosine(d1, d4)\n","print(cosd1d4)"],"execution_count":15,"outputs":[{"output_type":"stream","name":"stdout","text":["0\n"]}]},{"cell_type":"markdown","metadata":{"id":"ap8y7fDKMW3a"},"source":["και αντίστοιχα:"]},{"cell_type":"code","metadata":{"id":"Jn5n6KSsMav0","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433790545,"user_tz":-120,"elapsed":35,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"8104e110-17d0-4360-9239-77beccf78ecf"},"source":["cosd1d2 = sp.spatial.distance.cosine(d1, d2)\n","cosd1d3 = sp.spatial.distance.cosine(d1, d3)\n","cosd2d3 = sp.spatial.distance.cosine(d2, d3)\n","\n","print(\"d1 με d2\", cosd1d2)\n","print(\"d1 με d3\", cosd1d3)\n","print(\"d2 με d3\", cosd2d3)"],"execution_count":16,"outputs":[{"output_type":"stream","name":"stdout","text":["d1 με d2 0.43305329048615915\n","d1 με d3 0.7113248654051871\n","d2 με d3 0.5635642195280153\n"]}]},{"cell_type":"markdown","metadata":{"id":"IYdHyGzqN1ak"},"source":["Η ομοιότητα (απόσταση) συνημιτόνου κανονικοποιεί με βάση τις νόρμες (μήκος) των κειμένων (άρα δεν έχουμε διανύσματα διαφορετικών μηκών) και είναι 1 (0) για διανύσματα που έχουν ακριβώς την ίδια \"γωνία\" στον υπερχώρο διαστάσεων Ν (o αριθμός των μοναδικών λέξεων) του vector space model.\n","\n","\n","![Imgur](https://i.imgur.com/Kl6MFc8.png)\n","\n"]},{"cell_type":"markdown","metadata":{"id":"frHOFj8VoYui"},"source":["## Μετατροπή κειμένων σε διανύσματα\n"]},{"cell_type":"markdown","metadata":{"id":"HSn9R6mUPWKT"},"source":["Στη συνέχεια παρουσιάζουμε μερικά κλασικά βήματα που χρησιμοποιούμε για τη μετατροπή των κειμένων σε διανύσματα στο VSM. Τα βήματα αυτά περιλαμβάνουν κάποιες βελτιώσεις σε σχέση με το βασικό μοντέλο.\n","\n","Επειδή αν έχουμε μια μεγάλη συλλογή κειμένων το να λάβουμε όλες τις μοναδικές λέξεις ως χαρακτηριστικά του VSM μπορεί να οδηγήσει σε πάρα πολύ μεγάλες διαστάσεις, είναι βασικό μας μέλημα να χρησιμοποιούμε διάφορες τεχνικές ώστε να περιορίζουμε όσο είναι δυνατό αυτή τη διαστατικότητα χωρίς να χάνουμε περιεχόμενο (σημασία). "]},{"cell_type":"markdown","metadata":{"id":"xrbaAeBTVJOu"},"source":["### Προεπεξεργασία κειμένου\n","\n","Τώρα που φορτώσαμε το κείμενο στην python, πρέπει να το επεξεργαστούμε. Επειδή ο υπολογιστής θεωρεί τα κεφαλαία και τα μικρά ως διαφορετικούς χαρακτήρες, το πρώτο πράγμα που πρέπει να κάνουμε είναι να τα κάνουμε **όλα πεζά**. Έπειτα θέλουμε να **χωρίσουμε τις λέξεις μια προς μια**, ώστε να φτιάξουμε μια λίστα τα στοιχεία της οποίας θα είναι οι λέξεις."]},{"cell_type":"code","metadata":{"id":"EQV7lIyYVJOu","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433791217,"user_tz":-120,"elapsed":698,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"72b1a6d8-4bc6-49fb-e06d-90d1cc879406"},"source":["documents = [\"Lionel Messi is the best football player in the world! Messi plays for Barcelona Football Club. Barcelona Football Club plays in the Spanish Primera Division.\",\n"," \"Lionel Messi a football player, playing for Barcelona Football Club, a Spanish football team.\", \n"," \"Barcelona is a city in a northern spanish province called Catalonia. It is the largest city in Catalonia and the second most populated spanish city.\", \n"," \"Python is a programming language. Python is an object-oriented programming language. Unlike COBOL, Python is a interpreted programming language.\", \n"," \"COBOL is a compiled computer programming language designed for business use. This programming language is imperative, procedural and, since 2002, object-oriented. But Python is better.\"]\n","\n","document = documents[1]\n","\n","nltk.download('punkt') # χρειάζεται για το tokenizer\n","words = nltk.word_tokenize(document)\n","\n","print(words)"],"execution_count":17,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package punkt to /root/nltk_data...\n","[nltk_data] Unzipping tokenizers/punkt.zip.\n"]},{"output_type":"stream","name":"stdout","text":["['Lionel', 'Messi', 'a', 'football', 'player', ',', 'playing', 'for', 'Barcelona', 'Football', 'Club', ',', 'a', 'Spanish', 'football', 'team', '.']\n"]}]},{"cell_type":"markdown","metadata":{"id":"6LhkTcBnVJOy"},"source":["Το tokenizer ουσιαστικά κάνει ό,τι και η built-in μέθοδος `.split()` των string, αλλά λίγο πιο έξυπνα. Για αρχή χωρίζει με βάση τόσο τα κενά (`' '`), όσο και τα tabs (`'\\t'`) και τα new lines (`'\\n'`). Επίσης όπως μπορούμε να δούμε και παραπάνω χωρίζει και τις παρενθέσεις από το περιεχόμενό τους.\n","\n","Το επόμενο βήμα είναι να διαγράψουμε από τη λίστα μας τα **σημεία στίξης**. Μόλις το κάνουμε αυτό, θέλουμε να διαγράψουμε και μερικές συχνά χρησιμοποιούμενες λέξεις που δεν προσφέρουν σημασιολογική αξία στο κείμενο (**stopwords**). Τυπικά stopwords στα αγγλικά είναι λέξεις όπως \"the\", \"a\", \"to\", \"and\", \"he\", \"she\" κοκ."]},{"cell_type":"code","metadata":{"id":"p6pypWyaVJOz","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433791218,"user_tz":-120,"elapsed":23,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"a3009678-ded6-49e7-8967-5bb0486b5e41"},"source":["nltk.download('stopwords') # κατεβάζουμε ένα αρχείο που έχει stopwords στα αγγλικά\n","from nltk.corpus import stopwords\n","import string\n","\n","'''\n","filtered_words = [word for word in words if word not in list(string.punctuation)] # το string.punctuation είναι απλά ένα\n"," # string που περιέχει όλα τα σημεία στίξης\n","\n","filtered_words = [word for word in filtered_words if word not in stopwords.words('english')] # το stopwords.words('english')\n"," # είναι μια λίστα που περιέχει\n"," # stopwords στα αγγλικά\n","'''\n","filtered_words = [word for word in words if word not in stopwords.words('english') + list(string.punctuation)]\n","\n","print(filtered_words)"],"execution_count":18,"outputs":[{"output_type":"stream","name":"stdout","text":["['Lionel', 'Messi', 'football', 'player', 'playing', 'Barcelona', 'Football', 'Club', 'Spanish', 'football', 'team']\n"]},{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data] Unzipping corpora/stopwords.zip.\n"]}]},{"cell_type":"markdown","metadata":{"id":"BlSuSAUzVJO3"},"source":["Πρέπει να κάνουμε καλύτερη δουλειά στην αφαίρεση των σημείων στίξης γιατί δεν αφαιρούνται οι λέξεις που περιέχουν περισσότερα από ένα τέτοια σημεία."]},{"cell_type":"code","metadata":{"id":"VcLJukuIVJO4","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433791219,"user_tz":-120,"elapsed":18,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"ad735b68-d322-4cd5-9020-1e5804ea4142"},"source":["def thorough_filter(words):\n"," filtered_words = []\n"," for word in words:\n"," pun = []\n"," for letter in word:\n"," pun.append(letter in string.punctuation)\n"," if not all(pun):\n"," filtered_words.append(word)\n"," return filtered_words\n"," \n","filtered_words = thorough_filter(filtered_words)\n","print(filtered_words)"],"execution_count":19,"outputs":[{"output_type":"stream","name":"stdout","text":["['Lionel', 'Messi', 'football', 'player', 'playing', 'Barcelona', 'Football', 'Club', 'Spanish', 'football', 'team']\n"]}]},{"cell_type":"markdown","metadata":{"id":"WCySdKVxVJO9"},"source":["### Stemming & Lemmatization\n","\n","Για γραμματικούς λόγους, τα κείμενα χρησιμοποιούν διαφορετικές μορφές μιας λέξης, όπως π.χ. *play*, *plays*, *playing*, *played*. Αυτό έχει σαν αποτέλεσμα πως, ενώ αναφερόμαστε σε κάποιο παρόμοιο σημασιολογικό περιεχόμενο, ο υπολογιστής τις καταλαβαίνει ως διαφορετικές και προσθέτει διαστάσεις στην αναπαράσταση. Για να λύσουμε αυτό το πρόβλημα, μπορούμε να χρησιμοποιήσουμε δύο γλωσσολογικούς μετασχηματισμούς, είτε την αφαίρεση της κατάληξης (stemming), είτε τη λημματοποίηση (lemmatization). Ο στόχος, τόσο της αφαίρεσης κατάληξης όσο και της λημματοποίησης, είναι να φέρουν τις διάφορες μορφές της λέξης σε μια κοινή μορφή βάσης. Πιο συγκεκριμένα:\n","\n","Η **αφαίρεση της κατάληξης** αναφέρεται σε μια ακατέργαστη ευριστική διαδικασία που απομακρύνει τα άκρα των λέξεων με την ελπίδα να επιτύχει αυτό το στόχο σωστά τις περισσότερες φορές.\n","\n","Η **λημματοποίηση** αναφέρεται στην απομάκρυνση της κλίσης των λέξεων και στην επιστροφή της μορφής της λέξης όπως θα τη βρίσκαμε στο λεξικό, με τη χρήση λεξιλογίου και μορφολογικής ανάλυσης των λέξεων. Η μορφή αυτή είναι γνωστή ως λήμμα (*lemma*)."]},{"cell_type":"code","metadata":{"id":"bTe8-vDgVJO-","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433793553,"user_tz":-120,"elapsed":2346,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"5ec77328-6be1-4bc4-fbbb-8a827bac83d4"},"source":["import nltk\n","nltk.download('omw-1.4')\n","nltk.download('wordnet') # απαραίτητα download για τους stemmer/lemmatizer\n","nltk.download('rslp')\n","\n","from nltk.stem import WordNetLemmatizer\n","wordnet_lemmatizer = WordNetLemmatizer()\n","\n","from nltk.stem.porter import PorterStemmer\n","porter_stemmer = PorterStemmer()\n","\n","lem_words = [wordnet_lemmatizer.lemmatize(word) for word in filtered_words]\n","stem_words = [porter_stemmer.stem(word) for word in filtered_words]\n","\n","print('\\n{:<20} {:<20} {:<20}'.format('Original', 'Stemmed', 'Lemmatized'))\n","print('-'*60)\n","for i in range(len(filtered_words)):\n"," print('{:<20} {:<20} {:<20}'.format(filtered_words[i], stem_words[i], lem_words[i]))"],"execution_count":20,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package omw-1.4 to /root/nltk_data...\n","[nltk_data] Downloading package wordnet to /root/nltk_data...\n","[nltk_data] Downloading package rslp to /root/nltk_data...\n","[nltk_data] Unzipping stemmers/rslp.zip.\n"]},{"output_type":"stream","name":"stdout","text":["\n","Original Stemmed Lemmatized \n","------------------------------------------------------------\n","Lionel lionel Lionel \n","Messi messi Messi \n","football footbal football \n","player player player \n","playing play playing \n","Barcelona barcelona Barcelona \n","Football footbal Football \n","Club club Club \n","Spanish spanish Spanish \n","football footbal football \n","team team team \n"]}]},{"cell_type":"markdown","metadata":{"id":"3828s8RpVJPB"},"source":["**Προσοχή:** χρησιμοποιούμε είτε stemming (πιο συχνά), είτε lemmatization, αλλά όχι και τα δύο μαζί. Το πρώτο βελτιώνει την ανάκληση, το δεύτερο την ακρίβεια.\n","\n","Αφότου έχουμε ολοκληρώσει τις γλωσσολογικές προεπεξεργασίες, θα ορίσουμε μια μικρή συλλογή κειμένων ώστε να προχωρήσουμε ένα παράδειγμα ομαδοποίησης κειμένων. \n","\n","Όπως βλέπουμε παρακάτω τα πρώτα δύο και τα τελευταία δύο κείμενα βρίσκονται σημασιολογικά κοντά μεταξύ τους."]},{"cell_type":"code","metadata":{"id":"hjhcVYkhVJPI","executionInfo":{"status":"ok","timestamp":1668433793554,"user_tz":-120,"elapsed":50,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["# Το νέο σύνολο κειμένων μας\n","documents = [\"Lionel Messi is the best football player in the world! Messi plays for Barcelona Football Club. Barcelona Football Club plays in the Spanish Primera Division.\",\n"," \"Lionel Messi a football player, playing for Barcelona Football Club, a Spanish football team.\", \n"," \"Barcelona is a city in a northern spanish province called Catalonia. It is the largest city in Catalonia and the second most populated spanish city.\", \n"," \"Python is a programming language. Python is an object-oriented programming language. Unlike COBOL, Python is a interpreted programming language.\", \n"," \"COBOL is a compiled computer programming language designed for business use. This programming language is imperative, procedural and, since 2002, object-oriented. But Python is better.\"]\n"],"execution_count":21,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"BhosxSZKUZaI"},"source":["Κάνουμε τη γνωστή μας προεπεξεργασία και τυπώνουμε τη συχνότητα κάθε token σε κάθε document."]},{"cell_type":"code","metadata":{"id":"j1_d_UMzTYrk","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433793555,"user_tz":-120,"elapsed":50,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"146084f9-1537-4635-c79b-64dac3a7d570"},"source":["import collections\n","\n","def preprocess_document(document):\n"," # όλα τα προηγούμενα βήματα που κάναμε μέχρι στιγμής\n"," words = nltk.word_tokenize(document.lower())\n"," filtered_words = [word for word in words if word not in stopwords.words('english') + list(string.punctuation)]\n"," filtered_words = thorough_filter(filtered_words)\n"," stemmed_words = [porter_stemmer.stem(wordnet_lemmatizer.lemmatize(word)) for word in filtered_words]\n"," cnt = collections.Counter(stemmed_words)\n"," return cnt\n","\n","preprocessed_documents = [preprocess_document(doc) for doc in documents]\n","\n","for doc in preprocessed_documents:\n"," print(doc)"],"execution_count":22,"outputs":[{"output_type":"stream","name":"stdout","text":["Counter({'footbal': 3, 'messi': 2, 'play': 2, 'barcelona': 2, 'club': 2, 'lionel': 1, 'best': 1, 'player': 1, 'world': 1, 'spanish': 1, 'primera': 1, 'divis': 1})\n","Counter({'footbal': 3, 'lionel': 1, 'messi': 1, 'player': 1, 'play': 1, 'barcelona': 1, 'club': 1, 'spanish': 1, 'team': 1})\n","Counter({'citi': 3, 'spanish': 2, 'catalonia': 2, 'barcelona': 1, 'northern': 1, 'provinc': 1, 'call': 1, 'largest': 1, 'second': 1, 'popul': 1})\n","Counter({'python': 3, 'program': 3, 'languag': 3, 'object-ori': 1, 'unlik': 1, 'cobol': 1, 'interpret': 1})\n","Counter({'program': 2, 'languag': 2, 'cobol': 1, 'compil': 1, 'comput': 1, 'design': 1, 'busi': 1, 'use': 1, 'imper': 1, 'procedur': 1, 'sinc': 1, '2002': 1, 'object-ori': 1, 'python': 1, 'better': 1})\n"]}]},{"cell_type":"markdown","metadata":{"id":"X-JniL0wVJPN"},"source":["Μια τυπική μέθοδος μείωσης της διασταστικότητας είναι να πετάμε τους πάρα πολύ συχνούς όρους, τους πάρα πολύ σπάνιους ή τους όρους που εμφανίζονται σε πολύ λίγα documents (μιλάμε πάντα για το σύνολο της συλλογής, όχι τα μεμονωμένα documents).\n","\n","Στο μικρό αυτό dataset αποφασίζουμε να πετάξουμε τους όρους που εμφανίζονται μόνο μία φορά στο σύνολο της συλλογής"]},{"cell_type":"code","metadata":{"id":"bU2vMn9DDVeu","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433793556,"user_tz":-120,"elapsed":46,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"a7b96d05-cf33-41fd-b923-a3cb95ea362e"},"source":["threshold = 1\n","\n","total_counter = preprocessed_documents[0]\n","\n","for i in range(1, len(preprocessed_documents)):\n"," total_counter += preprocessed_documents[i] # counter που περιέχει τα συνολικά αθροίσματα σε όλα τα κείμενα\n","\n","print(total_counter, '\\n')\n","\n","vocabulary = [word for word in total_counter if total_counter[word] > threshold] # κρατάμε μόνο τους όρους με συχνότητα εμφάνισης πάνω από το κατώφλι\n","preprocessed_documents = [preprocess_document(doc) for doc in documents]\n","\n","\n","print(vocabulary)"],"execution_count":23,"outputs":[{"output_type":"stream","name":"stdout","text":["Counter({'footbal': 6, 'program': 5, 'languag': 5, 'barcelona': 4, 'spanish': 4, 'python': 4, 'messi': 3, 'play': 3, 'club': 3, 'citi': 3, 'lionel': 2, 'player': 2, 'catalonia': 2, 'object-ori': 2, 'cobol': 2, 'best': 1, 'world': 1, 'primera': 1, 'divis': 1, 'team': 1, 'northern': 1, 'provinc': 1, 'call': 1, 'largest': 1, 'second': 1, 'popul': 1, 'unlik': 1, 'interpret': 1, 'compil': 1, 'comput': 1, 'design': 1, 'busi': 1, 'use': 1, 'imper': 1, 'procedur': 1, 'sinc': 1, '2002': 1, 'better': 1}) \n","\n","['lionel', 'messi', 'footbal', 'player', 'play', 'barcelona', 'club', 'spanish', 'citi', 'catalonia', 'python', 'program', 'languag', 'object-ori', 'cobol']\n"]}]},{"cell_type":"markdown","metadata":{"id":"yFY2a06iVJPR"},"source":["Για την ευκολία μας θα δημιουργήσουμε έναν πίνακα που στις γραμμές του θα έχει τα documents και στις στήλες του τις λέξεις και θα αποθηκεύσουμε μέσα σε αυτόν τον αριθμό εμφάνισης των όρων. "]},{"cell_type":"code","metadata":{"id":"pUIfU63sVJPS","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433793558,"user_tz":-120,"elapsed":41,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"85b9e476-5060-4016-ef35-dbecc6870c31"},"source":["freq_array = np.zeros((len(preprocessed_documents), len(vocabulary)))\n","\n","for i in range(len(preprocessed_documents)):\n"," for j in range(len(vocabulary)):\n"," freq_array[i,j] = preprocessed_documents[i][vocabulary[j]] \n","\n","print(vocabulary, '\\n')\n","print(freq_array)"],"execution_count":24,"outputs":[{"output_type":"stream","name":"stdout","text":["['lionel', 'messi', 'footbal', 'player', 'play', 'barcelona', 'club', 'spanish', 'citi', 'catalonia', 'python', 'program', 'languag', 'object-ori', 'cobol'] \n","\n","[[1. 2. 3. 1. 2. 2. 2. 1. 0. 0. 0. 0. 0. 0. 0.]\n"," [1. 1. 3. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]\n"," [0. 0. 0. 0. 0. 1. 0. 2. 3. 2. 0. 0. 0. 0. 0.]\n"," [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 3. 3. 1. 1.]\n"," [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 2. 2. 1. 1.]]\n"]}]},{"cell_type":"markdown","metadata":{"id":"vmawczqZVJPW"},"source":["### TF-IDF\n","\n","Η αναπαράσταση στο VSM μόνο με τις συχνότητες εμφάνισης κάθε όρου δεν είναι βέλτιστη. Στην πράξη χρησιμοποιούμε το **TF-IDF** (Term Frequency - Inverse Document Frequency).\n","\n","Όπως προσδίδει και το όνομά του, το tf-idf αποτελείται από 2 όρους. Ο πρώτος είναι το **Term Frequency (TF)**:\n","\n","$$ tf(i,d) = \\frac{f(i,d)}{\\sum_{i} f(i,d)}$$\n","\n","Όπου *i* ο όρος στο κείμενο *d*. Το tf είναι στην ουσία η συχνότητα με την οποία εμφανίζεται ο κάθε όρος στο κείμενο. Λέξεις με μεγάλη συχνότητα είναι σημαντικότερες για το κείμενο από ό,τι λέξεις με μικρή."]},{"cell_type":"code","metadata":{"id":"YXLO-r_RVJPX","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433793559,"user_tz":-120,"elapsed":37,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"799cf5f5-ffae-40bd-bfaf-6e8260565228"},"source":["print(freq_array.sum(axis=1)), '\\n' # ο αριθμός των όρων ανά κείμενο\n","\n","\n","for i in range(len(freq_array)):\n"," freq_array[i, :] = freq_array[i, :] / freq_array.sum(axis=1)[i] # Η συχνότητα του όρου (αριθμός εμφάνισης όρου / συνολικοί όροι στο κείμενο)\n","\n","print(freq_array)"],"execution_count":25,"outputs":[{"output_type":"stream","name":"stdout","text":["[14. 10. 8. 11. 7.]\n","[[0.07142857 0.14285714 0.21428571 0.07142857 0.14285714 0.14285714\n"," 0.14285714 0.07142857 0. 0. 0. 0.\n"," 0. 0. 0. ]\n"," [0.1 0.1 0.3 0.1 0.1 0.1\n"," 0.1 0.1 0. 0. 0. 0.\n"," 0. 0. 0. ]\n"," [0. 0. 0. 0. 0. 0.125\n"," 0. 0.25 0.375 0.25 0. 0.\n"," 0. 0. 0. ]\n"," [0. 0. 0. 0. 0. 0.\n"," 0. 0. 0. 0. 0.27272727 0.27272727\n"," 0.27272727 0.09090909 0.09090909]\n"," [0. 0. 0. 0. 0. 0.\n"," 0. 0. 0. 0. 0.14285714 0.28571429\n"," 0.28571429 0.14285714 0.14285714]]\n"]}]},{"cell_type":"markdown","metadata":{"id":"n-ft9F1eVJPa"},"source":["Ο δεύτερος όρος στο tf-idf είναι το **Inverse Document Frequency**:\n","\n","$$ idf(i) = log \\frac{N}{df(i)}$$\n","\n","Όπου *Ν* ο αριθμός των κειμένων και *df(i)* ο αριθμός των κειμένων στους οποίους εμφανίζεται ο όρος *i*. Το idf είναι ένας δείκτης της πληροφορίας που δίνει η κάθε λέξη. Αν η λέξη εμφανίζεται σε όλα τα κείμενα τότε αυτή δε δίνει καθόλου πληροφορία και το κλάσμα θα γίνει 1, άρα ο λογάριθμος θα μας δώσει την τιμή 0. Αντίθετα σε όσο πιο λίγα κείμενα εμφανίζεται η λέξη, τόσο πιο μεγάλη τιμή θα έχει το κλάσμα. "]},{"cell_type":"code","metadata":{"id":"YmsGVtFDVJPb","executionInfo":{"status":"ok","timestamp":1668433793562,"user_tz":-120,"elapsed":35,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["non_zero_elements_per_row = np.zeros((len(freq_array[0])))\n","\n","for i in range(len(freq_array)):\n"," for j in range(len(freq_array[0])):\n"," if freq_array[i,j]>0.0:\n"," non_zero_elements_per_row[j] += 1\n","\n","#non_zero_elements_per_row = np.count_nonzero(freq_array, axis=0)\n","\n","idf = np.log10(float(len(freq_array))/non_zero_elements_per_row) # ο αριθμητής του κλάσματος είναι ο αριθμός των κειμένων μας \n"," # (ή ο αριθμός των γραμμών στον πίνακα freq_array)\n"," # η np.count_zero μετράει πόσα μη μηδενικά στοιχεία έχει ο πίνακας \n"," # (στην περίπτωσή μας ο παρονομαστής του κλάσματος του idf)"],"execution_count":26,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"BhOpEdJ6VJPd"},"source":["Το tf-idf τελικά υπολογίζεται ως το γινόμενο των δύο όρων:\n","\n","$$ tfidf(i) = tf(i,d) \\cdot idf(i)$$"]},{"cell_type":"code","metadata":{"id":"oakbvK76VJPe","scrolled":true,"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433793563,"user_tz":-120,"elapsed":36,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"8253d615-f155-4113-b7b1-f65659a996d9"},"source":["tf_idf = freq_array * idf # το tf-idf είναι απλά το γινόμενο του tf με το idf\n","\n","print(tf_idf)"],"execution_count":27,"outputs":[{"output_type":"stream","name":"stdout","text":["[[0.02842429 0.05684857 0.08527286 0.02842429 0.05684857 0.03169268\n"," 0.05684857 0.01584634 0. 0. 0. 0.\n"," 0. 0. 0. ]\n"," [0.039794 0.039794 0.119382 0.039794 0.039794 0.02218487\n"," 0.039794 0.02218487 0. 0. 0. 0.\n"," 0. 0. 0. ]\n"," [0. 0. 0. 0. 0. 0.02773109\n"," 0. 0.05546219 0.26211375 0.1747425 0. 0.\n"," 0. 0. 0. ]\n"," [0. 0. 0. 0. 0. 0.\n"," 0. 0. 0. 0. 0.10852909 0.10852909\n"," 0.10852909 0.03617636 0.03617636]\n"," [0. 0. 0. 0. 0. 0.\n"," 0. 0. 0. 0. 0.05684857 0.11369715\n"," 0.11369715 0.05684857 0.05684857]]\n"]}]},{"cell_type":"markdown","metadata":{"id":"eKCWHEfuVJPi"},"source":["Για να δούμε ποιο κείμενο βρίσκεται πιο κοντά στο άλλο, υπολογίζουμε απλά τις αποστάσεις του ενός διανύσματος απ' το άλλο."]},{"cell_type":"code","metadata":{"id":"QyW1IV6zVJPi","scrolled":true,"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433793565,"user_tz":-120,"elapsed":32,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"ef7c48eb-69ff-4364-d38d-7835328b6536"},"source":["distances = np.zeros((len(tf_idf), len(tf_idf)))\n","\n","import scipy as sp\n","for i in range(len(tf_idf)):\n"," for j in range(len(tf_idf)):\n"," distances[i,j]= sp.spatial.distance.cosine(tf_idf[i],tf_idf[j])\n"," #distances[i,j] = sum(np.abs(tf_idf[i] - tf_idf[j])) # το ίδιο με ευκλείδια, εδώ είναι λιγότερο κακή γιατί έχουμε κανονικοποιησει ως προς το μήκος\n","print(distances)"],"execution_count":28,"outputs":[{"output_type":"stream","name":"stdout","text":["[[0. 0.05358886 0.96113036 1. 1. ]\n"," [0.05358886 0. 0.96222231 1. 1. ]\n"," [0.96113036 0.96222231 0. 1. 1. ]\n"," [1. 1. 1. 0. 0.04818273]\n"," [1. 1. 1. 0.04818273 0. ]]\n"]}]},{"cell_type":"markdown","metadata":{"id":"N65CJy19VJPm"},"source":["Όπως παρατηρούμε τα πρώτα 2 και τα τελευταία 2 διανύσματα του πίνακα έχουν πολύ μικρή απόσταση (της τάξης του 0.1). Αντίθετα όλα τα υπόλοιπα έχουν απόσταση μεγαλύτερη από 0.5.\n"]},{"cell_type":"markdown","metadata":{"id":"w0T-s4Zx19cL"},"source":["## Εφαρμογή: Ομαδοποίηση κειμένων"]},{"cell_type":"markdown","metadata":{"id":"oWmjWb8H159y"},"source":["\n","### Ιεραρχικό Clustering\n","\n","Στην εφαρμογή αυτή θα χρησιμοποιήσουμε ιεραρχικό clustering με την [μέθοδο ελαχιστοποίησης της διασποράς του Ward](https://en.wikipedia.org/wiki/Ward%27s_method). Ο αλγόριθμος αυτός ψάχνει αναδρομικά να βρει το ζεύγος των cluster που αν τα ενώσουμε θα δώσει την ελάχιστη αύξηση στη συνολική εσωτερική διασπορά των cluster. (Σημ. εσωτερική διασπορά θεωρούμε τη διασπορά όλων των σημείων από το κέντρο του cluster και ορίζεται για κάθε cluster ξεχωριστά. Συνολική εσωτερική διασπορά θεωρούμε το άθροισμα όλων των εσωτερικών διασπορών για κάθε cluster).\n","\n","Αρχικά θεωρεί ότι το κάθε σημείο είναι κι από ένα cluster. Έπειτα ψάχνει να βρει ποιο ζεύγος σημείων, αν ενωθούν σε ένα cluster, θα οδηγήσουν στην ελάχιστη αύξηση της συνολικής εσωτερικής διασποράς. Προφανώς, στην περίπτωση αυτή θα είναι τα δύο πιο κοντινά σημεία. Η διαδικασία αυτή επαναλαμβάνεται μέχρις ότου να καταλήξουμε σε 2 ομάδες. \n","\n","Αυτό μπορεί να υλοποιηθεί πολύ απλά στο [scikit-learn](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) στο προηγούμενο παράδειγμα με τις προτάσεις:"]},{"cell_type":"code","metadata":{"id":"0gzxgxaXVJPn","colab":{"base_uri":"https://localhost:8080/","height":268},"executionInfo":{"status":"ok","timestamp":1668433793567,"user_tz":-120,"elapsed":30,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"0004725e-eb6d-4fc9-a30c-76e4bc0d1209"},"source":["import matplotlib.pyplot as plt\n","\n","from scipy.cluster import hierarchy\n","Z = hierarchy.linkage(tf_idf, 'ward') # εκπαιδεύει τον αλγόριθμο\n","plt.figure()\n","dn = hierarchy.dendrogram(Z) # σχεδιάζει ένα δενδρόγραμμα με το αποτέλεσμα του ιεραρχικού αλγορίθμου"],"execution_count":29,"outputs":[{"output_type":"display_data","data":{"text/plain":["
"],"image/png":"iVBORw0KGgoAAAANSUhEUgAAAXQAAAD7CAYAAAB68m/qAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAAsTAAALEwEAmpwYAAANDklEQVR4nO3df6zd9V3H8eeL1joD1IRw3UhbKM4OvYqWeQUXoyxKspZFajLIWtwyCUtxs8kM/mETkZhuidlm5l9VacJwYWL5kSmNXMcf05ks2UgvrhNbdlmtDFpL1gmuAwas7u0f9xTvLrf3nMLp/d5+zvOR3PR8v98P57zbkme/93vuOSdVhSTp7HdO1wNIkobDoEtSIwy6JDXCoEtSIwy6JDVieVcPfOGFF9batWu7enhJOis9+uij366qsfmOdRb0tWvXMjU11dXDS9JZKck3T3XMSy6S1AiDLkmNMOiS1AiDLkmNMOiS1AiDLkmNMOiS1AiDLkmN6OyFRUvBPY88xYP7jnQ9hjSvTetXceNVF3c9hs4iI32G/uC+Ixw4erzrMaTXOHD0uCcbOm0jfYYOMH7RSu695R1djyH9kPfe8eWuR9BZaKTP0CWpJQZdkhph0CWpEQZdkhph0CWpEQZdkhph0CWpEQZdkhph0CWpEQMFPcmGJNNJDibZvsC69ySpJBPDG1GSNIi+QU+yDNgJbATGgS1JxudZdz7wEeCRYQ8pSepvkDP0K4GDVXWoql4BdgOb5ln3UeDjwEtDnE+SNKBBgr4KeHrW9uHevlcleTuwpqoeWuiOkmxNMpVk6tixY6c9rCTp1N7wk6JJzgE+BfxBv7VVtauqJqpqYmxs7I0+tCRplkGCfgRYM2t7dW/fSecDPwd8McmTwC8De3xiVJIW1yBB3wusS3JpkhXAZmDPyYNV9Z2qurCq1lbVWuArwHVVNXVGJpYkzatv0KvqBLANeBh4HLivqvYn2ZHkujM9oCRpMAN9YlFVTQKTc/bdfoq173zjY0mSTpevFJWkRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRhh0SWqEQZekRizvegDppHseeYoH9x3peowl4cDR4wC8944vdzzJ0rBp/SpuvOrirsdY8jxD15Lx4L4jr4Zs1I1ftJLxi1Z2PcaScODocf+hH5Bn6FpSxi9ayb23vKPrMbSE+F3K4DxDl6RGGHRJaoRBl6RGGHRJaoRBl6RGGHRJaoRBl6RGGHRJaoRBl6RGGHRJaoRBl6RGGHRJaoRBl6RGDBT0JBuSTCc5mGT7PMd/N8ljSfYl+VKS8eGPKklaSN+gJ1kG7AQ2AuPAlnmCfU9VXV5V64FPAJ8a9qCSpIUNcoZ+JXCwqg5V1SvAbmDT7AVVNftTCc4FangjSpIGMcgHXKwCnp61fRi4au6iJL8H3AqsAH59vjtKshXYCnDxxX6clCQN09CeFK2qnVX1VuAPgdtOsWZXVU1U1cTY2NiwHlqSxGBBPwKsmbW9urfvVHYDv/UGZpIkvQ6DBH0vsC7JpUlWAJuBPbMXJFk3a/PdwDeGN6IkaRB9r6FX1Ykk24CHgWXAp6tqf5IdwFRV7QG2JbkG+D7wHPCBMzm0JOm1BnlSlKqaBCbn7Lt91u2PDHkuSdJp8pWiktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktSIgYKeZEOS6SQHk2yf5/itSQ4k+bckX0hyyfBHlSQtpG/QkywDdgIbgXFgS5LxOcu+CkxU1c8DDwCfGPagkqSFDXKGfiVwsKoOVdUrwG5g0+wFVfXPVfVib/MrwOrhjilJ6meQoK8Cnp61fbi371RuBv5xvgNJtiaZSjJ17NixwaeUJPU11CdFk7wPmAA+Od/xqtpVVRNVNTE2NjbMh5akkbd8gDVHgDWztlf39v2QJNcAfwRcXVUvD2c8SdKgBjlD3wusS3JpkhXAZmDP7AVJrgDuAK6rqm8Nf0xJUj99g15VJ4BtwMPA48B9VbU/yY4k1/WWfRI4D7g/yb4ke05xd5KkM2SQSy5U1SQwOWff7bNuXzPkuSRJp8lXikpSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwy6JDXCoEtSIwYKepINSaaTHEyyfZ7jv5bkX5OcSHL98MeUJPXTN+hJlgE7gY3AOLAlyficZU8BvwPcM+wBJUmDWT7AmiuBg1V1CCDJbmATcODkgqp6snfsB2dgRknSAAa55LIKeHrW9uHePknSErKoT4om2ZpkKsnUsWPHFvOhJal5gwT9CLBm1vbq3r7TVlW7qmqiqibGxsZez11Ikk5hkKDvBdYluTTJCmAzsOfMjiVJOl19g15VJ4BtwMPA48B9VbU/yY4k1wEk+aUkh4EbgDuS7D+TQ0uSXmuQn3KhqiaByTn7bp91ey8zl2IkSR3xlaKS1AiDLkmNMOiS1AiDLkmNMOiS1AiDLkmNMOiS1AiDLkmNMOiS1AiDLkmNMOiS1AiDLkmNMOiS1IiB3m1R0ui5/4n7mTw02X/hGTb97NUA3PT5XZ3Oce1PXssNb7uh0xn6MeiS5jV5aJLpZ6e57ILLOp3jiiv+pdPHB5h+dhrAoEs6e112wWXcteGursfo3E2fv6nrEQbiNXRJaoRBl6RGeMlFM6bugsce6HaGZzbN/HrXx7qd4/LrYeLs+BZbms2ga8ZjD8Azj8FbLu9shHsvfrCzx37VM4/N/GrQdRYy6Pp/b7kcbnqo6ym6dde7u55Aet28hi5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjTDoktQIgy5JjRgo6Ek2JJlOcjDJ9nmO/2iSe3vHH0myduiTSpIW1DfoSZYBO4GNwDiwJcn4nGU3A89V1U8Bfw58fNiDSpIWNsgZ+pXAwao6VFWvALuBTXPWbAI+07v9APAbSTK8MSVJ/aSqFl6QXA9sqKoP9rbfD1xVVdtmrfn33prDve3/6K359pz72gps7W1eBkwP6zciSSPikqoam+/A8sWcoqp2AbsW8zElaVQMcsnlCLBm1vbq3r551yRZDvw48N/DGFCSNJhBgr4XWJfk0iQrgM3Anjlr9gAf6N2+Hvin6nctR5I0VH0vuVTViSTbgIeBZcCnq2p/kh3AVFXtAe4E7k5yEHiWmehLkhZR3ydFJUlnB18pKkmNMOiS1AiDLkmNGMmg99575s4k30zy3ST7kmzseq6uJLkgyd8leaH3Z3Jj1zN1Icm2JFNJXk7y113P06Ukn01yNMnxJE8k+WDXM3UtybokLyX5bNeznMqivrBoCVkOPA1cDTwFXAvcl+Tyqnqyy8E6shN4BXgzsB54KMnXqmp/p1Mtvv8CPga8C/ixjmfp2p8CN1fVy0l+Gvhikq9W1aNdD9ahncz8GPeSNZJn6FX1QlX9SVU9WVU/qKp/AP4T+MWuZ1tsSc4F3gP8cVU9X1VfYuZ1Be/vdrLFV1Wfq6q/xxfFUVX7q+rlk5u9r7d2OFKnkmwG/gf4QsejLGgkgz5XkjcDbwNG7YwUZn7fJ6rqiVn7vgb8bEfzaIlI8hdJXgS+DhwFJjseqRNJVgI7gFu7nqWfkQ96kh8B/gb4TFV9vet5OnAecHzOvu8A53cwi5aQqvowM/8f/CrwOeDlhf+LZn0UuPPkmw8uZSMd9CTnAHczc/14W5/lrXoeWDln30rgux3MoiWmqv63dxluNfChrudZbEnWA9cw8zkPS96oPilK7/3a72TmicBrq+r7HY/UlSeA5UnWVdU3evt+gdG8/KRTW85oXkN/J7AWeKr3EQ/nAcuSjFfV2zuca16jfIb+l8DPAL9ZVd/repiuVNULzHw7vSPJuUl+hZkPLLm728kWX5LlSd7EzHsWLUvypt67h46UJD+RZHOS85IsS/IuYAtL/AnBM2QXM/+Qre99/RXwEDM/CbXkjGTQk1wC3MLMX9AzSZ7vff12t5N15sPM/Jjet4C/BT40gj+yCHAb8D1gO/C+3u3bOp2oG8XM5ZXDwHPAnwG/33sjvpFSVS9W1TMnv5i5RPlSVR3rerb5+OZcktSIkTxDl6QWGXRJaoRBl6RGGHRJaoRBl6RGGHRJaoRBl6RGGHRJasT/AZzJ3WkYG0dvAAAAAElFTkSuQmCC\n"},"metadata":{"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"uGx-U97PVJPq"},"source":["Παρατηρούμε ότι οι προτάσεις που βρίσκονται κοντά μεταξύ τους καταλήγουν και σε κοινά cluster. Θα προσπαθήσουμε να εφαρμόσουμε την τεχνική αυτή και σε ένα πραγματικό πρόβλημα με αληθινά κείμενα.\n"]},{"cell_type":"markdown","metadata":{"id":"uHZHhc7iaZUi"},"source":["\n","### 20 Newsgroups dataset\n","\n","Ως πραγματικό παράδειγμα θα χρησιμοποιήσουμε το [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset, το οποίο υπάρχει και μέσα στο [sklearn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). Για το clustering θα χρησιμοποιήσουμε τον ιεραρχικό αλγόριθμο που μελετήσαμε προηγουμένως."]},{"cell_type":"code","metadata":{"id":"01oYTGVYVJPr","executionInfo":{"status":"ok","timestamp":1668433808321,"user_tz":-120,"elapsed":14782,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["from sklearn.datasets import fetch_20newsgroups\n","newsgroups = fetch_20newsgroups(subset='all')"],"execution_count":30,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"G_PUNTJkcU90"},"source":["To 20 newsgroups είναι και αυτό ένα dataset για κατηγοριοποίηση ή ομαδοποίηση κειμένων. Τυπώνουμε τις κατηγορίες των κειμένων:"]},{"cell_type":"code","metadata":{"id":"hGjtK1xabZ1d","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433808322,"user_tz":-120,"elapsed":21,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"63ea062e-2493-46df-9155-acc0fe4885a9"},"source":["print(newsgroups.target_names)"],"execution_count":31,"outputs":[{"output_type":"stream","name":"stdout","text":["['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n"]}]},{"cell_type":"markdown","metadata":{"id":"pquHlH5tVJPw"},"source":["Θα πάρουμε 3 κατηγορίες από το dataset αυτό και θα δημιουργήσουμε ένα corpus με τα πρώτα 5 κείμενα από κάθε κατηγορία. Για ευκολία θα επιλέξουμε 3 αρκετά ξεκάθαρες μεταξύ τους κατηγορίες."]},{"cell_type":"code","metadata":{"id":"Tq6P3Xd3VJPy","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433808872,"user_tz":-120,"elapsed":564,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"b3d96310-c9e5-4c55-d168-bc2809611404"},"source":["import functools\n","\n","categ = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']\n","data = functools.reduce(lambda x,y: x+y, [fetch_20newsgroups(categories=[x], remove=('headers', 'footers'))['data'][:5] for x in categ])\n","print('Input shape:', len(data), '\\n')\n","print(data[0][:1000])\n","print(\"-----\")\n","print(data[13][:1000])"],"execution_count":32,"outputs":[{"output_type":"stream","name":"stdout","text":["Input shape: 15 \n","\n","In <16BA7103C3.I3150101@dbstu1.rz.tu-bs.de> I3150101@dbstu1.rz.tu-bs.de (Benedikt Rosenau) writes:\n","\n",">In article <1993Apr5.091258.11830@monu6.cc.monash.edu.au>\n",">darice@yoyo.cc.monash.edu.au (Fred Rice) writes:\n","> \n",">(Deletion)\n",">>>>Of course people say what they think to be the religion, and that this\n",">>>>is not exactly the same coming from different people within the\n",">>>>religion. There is nothing with there existing different perspectives\n",">>>>within the religion -- perhaps one can say that they tend to converge on\n",">>>>the truth.\n",">>\n",">>>My point is that they are doing a lot of harm on the way in the meantime.\n",">>>\n",">>>And that they converge is counterfactual, religions appear to split and\n",">>>diverge. Even when there might be a 'True Religion' at the core, the layers\n",">>>above determine what happens in practise, and they are quite inhumane\n",">>>usually.\n",">>>\n","> \n",">What you post then is supposed to be an answer, but I don't see what is has\n",">got to do with what I say.\n","> \n",">I will repeat it. Religions\n","-----\n","Disclaimer -- This is for fun.\n","\n","In my computerized baseball game, I keep track of a category called\n","\"stolen hits\", defined as a play made that \"an average fielder would not\n","make with average effort.\" Using the 1992 Defensive Averages posted\n","by Sherri Nichols (Thanks Sherri!), I've figured out some defensive stats\n","for the second basemen. Hits Stolen have been redefined as \"Plays Kurt\n","Stillwell would not have made.\"\n","\n","OK, I realize that's unfair. Kurt's probably the victim of pitching staff,\n","fluke shots, and a monster park factor. But let's put it this way: If we\n","replaced every second baseman in the NL with someone with Kurt's 57.6% out\n","making ability, how many extra hits would go by?\n","\n","To try and correlate it to reality a little more, I've calculated Net\n","Hits Stolen, based on the number of outs made compared to what a league\n","average fielder would make. By the same method I've calculated Net Double\n","Plays, and Net Extra Bases (doubles and triples let by).\n","\n","Finally, I throw all this int\n"]}]},{"cell_type":"markdown","metadata":{"id":"UhkPMD3lVJP2"},"source":["Σημειώστε εδώ ότι με το να πετάμε τους πολύ σπάνιους όρους ξεφορτωνόμαστε διάφορα σπάνια strings όπως emails, τυπογραφικά λάθη, \"παράξενα\" σύμβολα κλπ\n","\n","\n","### TfidfVectorizer και μείωση της διαστατικότητας του VSM\n","\n","Για την προεπεξεργασία των αρχείων θα χρησιμοποιήσουμε τη συνάρτηση [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) του sklearn. Η συγκεκριμένη συνάρτηση έχει τη δυνατότητα να υποστηρίξει και όλη την [προεπεξεργασία](http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes) που κάναμε προηγουμένως (stopwords, stemming, lematizing, κτλ). Επίσης δέχεται και πολλές επιπλέον παραμέτρους όπως την `max_df=x` η οποία αγνοεί τους όρους που εμφανίζονται σε ποσοστό `x` των κειμένων και πάνω (δηλ. λέξεις πολύ συχνές στο συγκεκριμένο σύνολο κειμένων), και την `min_df=y` η οποία αγνοεί τους όρους οι οποίοι εμφανίζονται σε ποσοστό μικρότερο από `y` του συνόλου των κειμένων (δηλ. πολύ σπάνιοι όροι). Για το παρακάτω παράδειγμα. Το "]},{"cell_type":"code","metadata":{"id":"eGUujy-TVJP3","colab":{"base_uri":"https://localhost:8080/","height":355},"executionInfo":{"status":"ok","timestamp":1668433809398,"user_tz":-120,"elapsed":535,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"a86952e1-1149-44b5-bace-3857587b75bb"},"source":["from sklearn.feature_extraction.text import TfidfVectorizer\n","\n","print(\"Dimensions before optimizing TfidfVectorizer parameters\")\n","vectorizer = TfidfVectorizer()\n","tf_idf_array = vectorizer.fit_transform(data).toarray() # επιστρέφει sparse matrix, γι'αυτό το κάνουμε .toarray()\n","print('TF-IDF array shape:', tf_idf_array.shape)\n","\n","print(\"Dimensions after optimizing TfidfVectorizer parameters\")\n","vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')\n","tf_idf_array = vectorizer.fit_transform(data).toarray() # επιστρέφει sparse matrix, γι'αυτό το κάνουμε .toarray()\n","print('TF-IDF array shape:', tf_idf_array.shape)\n","Z = hierarchy.linkage(tf_idf_array, 'ward')\n","\n","labels = ['a'] * 5 + ['g'] * 5 + ['b'] * 5 # 'a' = atheism, 'g' = graphics, 'b' = baseball \n","plt.figure()\n","dn = hierarchy.dendrogram(Z, labels=labels, color_threshold=0) # σχεδιάζει ένα δενδρόγραμμα με το αποτέλεσμα του ιεραρχικού αλγορίθμου\n","\n","colors = {'a':'r', 'g':'g', 'b':'b'}\n","for l in plt.gca().get_xticklabels():\n"," l.set_color(colors[l.get_text()])\n","print"],"execution_count":33,"outputs":[{"output_type":"stream","name":"stdout","text":["Dimensions before optimizing TfidfVectorizer parameters\n","TF-IDF array shape: (15, 1232)\n","Dimensions after optimizing TfidfVectorizer parameters\n","TF-IDF array shape: (15, 124)\n"]},{"output_type":"execute_result","data":{"text/plain":[""]},"metadata":{},"execution_count":33},{"output_type":"display_data","data":{"text/plain":["
"],"image/png":"iVBORw0KGgoAAAANSUhEUgAAAXQAAAD7CAYAAAB68m/qAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAAsTAAALEwEAmpwYAAAQaElEQVR4nO3de4wdZ3nH8e+POGnVlpCCFyn4gqPioLpcCl0CFVQEQVWHSrFo0zbmVlDAVYVRJVDVVNC4Du0fNOpFqKHB0NQtFblAuVjUEKQWGglImo1I08RRwJiLbaJ6SbhIIAhunv6xx3Tj7O45a8/ZPfP6+5EinZl59ZwnUc5P7868M5OqQpLUf49b7QYkSd0w0CWpEQa6JDXCQJekRhjoktSINav1xWvXrq1Nmzat1tdLUi/deeed36yqqYWOrVqgb9q0iZmZmdX6eknqpSRfW+yYp1wkqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjVi1G4ta9oHbv87H7jq62m2oAdt+cR2vfP7G1W5DPeEMfQw+dtdRDjzw3dVuQz134IHvOjHQsjhDH5Mt55/LTb/3y6vdhnrsd97z+dVuQT1joKt5fT0FduKvvL4Gu6eLVp6nXNS8vp4C23L+uWw5/9zVbuOUeLpodThD1xnBU2Arq69/VfTd0Bl6kuuTHEtyzxJjLk5yV5J7k/xHty1KkkYxyimXvcDWxQ4mOQ94N3BpVf0C8FuddCZJWpahgV5VtwIPLTHklcCHq+rrg/HHOupNkrQMXVwUvRD42SSfSXJnktcuNjDJjiQzSWZmZ2c7+GpJ0gldBPoa4JeAXwd+DfiTJBcuNLCq9lTVdFVNT00t+Eo8SdIp6mKVyxHgwar6HvC9JLcCzwa+2EHtZZuENceTsn7YdcDSmaWLGfrHgBclWZPkp4DnA/d1UPfUmpmANceTsH7YdcDSmWfoDD3JDcDFwNokR4BdwNkAVXVdVd2X5JPA3cAjwPuqatEljivBNcer/9eBpJU3NNCravsIY64BrumkI0nSKfHWf0lqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqRG+JFrqkUl4PPQoJuUR0sO09ohpA32VjesHOu4fVGs/hL448Xjo1X488zCT3h/8/2+kpf+PDfRVNq4f6Dh/UC3+EPrEx0N3Y9L/ejgVBvoE6NsPtMUfgtQCL4pKUiOGBnqS65McS7LkW4iSPC/J8SSXddeeJGlUo8zQ9wJblxqQ5CzgncCnOuhJknQKhgZ6Vd0KPDRk2JuBfwGOddGUJGn5TvuiaJJ1wCuAlwDPGzJ2B7ADYONGV0hIGs6lvaPr4qLo3wB/VFWPDBtYVXuqarqqpqempjr4akmtO7G0t2tbzj93bMt7Dzzw3VW5AayLZYvTwI1JANYCL09yvKo+2kFtSXJp74hOO9Cr6oITn5PsBT5umEvSyhsa6EluAC4G1iY5AuwCzgaoquvG2p0kTYjlnMtf7vn5rs63Dw30qto+arGqet1pdSNJE2o5j+lYzrn5Lh+l4a3/0grpYrVGFyszfLDaqRvHufwuz7d767+0QrpYrXG6KzNWa/WFVoYzdGkFrfZqDR+s1jZn6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRLlsU0I/bmiUtzRm6gOXd9LKcm1u8kUVaOc7Q9WOTfluzpKU5Q5ekRhjoktQIA12SGmGgS1IjhgZ6kuuTHEtyzyLHX5Xk7iT/neRzSZ7dfZuSpGFGmaHvBbYucfwrwIur6pnAO4A9HfQlSVqmUV5Bd2uSTUsc/9y8zduA9R30Jalh3sg2Hl2vQ78C+MRiB5PsAHYAbNx4Zv4Hl7o2znCE8QRkH97P2UedBXqSlzAX6C9abExV7WFwSmZ6erq6+m7pTDaucITxBqQ3snWvk0BP8izgfcAlVfVgFzUljW5cr7Y70wOyb0572WKSjcCHgddU1RdPvyVJ0qkYOkNPcgNwMbA2yRFgF3A2QFVdB1wFPAl4dxKA41U1Pa6GJUkLG2WVy/Yhx98AvKGzjiRJp8Q7RSWpERP/+NzlLMkC16xKOnNN/Ax9OS9eAF++IOnMNfEzdHBJliSNYuJn6JKk0RjoktQIA12SGtGLc+jSyfr4QCpp3Jyhq5eWs/ppOSufwNVP6i9n6OotVz9Jj+YMXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDViaKAnuT7JsST3LHI8Sd6V5GCSu5M8t/s2JUnDjDJD3wtsXeL4JcDmwT87gL87/bYkScs1NNCr6lbgoSWGbAP+qebcBpyX5PyuGpQkjaaLc+jrgMPzto8M9j1Gkh1JZpLMzM7OdvDVkqQTVvSiaFXtqarpqpqemppaya+WpOZ18SyXo8CGedvrB/vOWON8EqBPAZS0mC5m6PuA1w5Wu7wA+E5VPdBB3d4a15MAfQqgpKUMnaEnuQG4GFib5AiwCzgboKquA/YDLwcOAt8HXj+uZvtkHE8C9CmAkpYyNNCravuQ4wW8qbOOJEmnxDtFJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJasRIgZ5ka5L7kxxMcuUCxzcm+XSSLyS5O8nLu29VkrSUoYGe5CzgWuASYAuwPcmWk4a9Hbi5qp4DXA68u+tGJUlLG2WGfhFwsKoOVdXDwI3AtpPGFHDixZhPAL7RXYuSpFGMEujrgMPzto8M9s33p8CrB+8c3Q+8eaFCSXYkmUkyMzs7ewrtSpIW09VF0e3A3qpaz9wLo9+f5DG1q2pPVU1X1fTU1FRHXy1JgtEC/SiwYd72+sG++a4Abgaoqs8DPwms7aJBSdJoRgn0O4DNSS5Icg5zFz33nTTm68BLAZL8PHOB7jkVSVpBQwO9qo4DO4FbgPuYW81yb5Krk1w6GPZW4I1J/gu4AXhdVdW4mpYkPdaaUQZV1X7mLnbO33fVvM8HgBd225okaTm8U1SSGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1IiRAj3J1iT3JzmY5MpFxvx2kgNJ7k3ygW7blCQNM/SNRUnOAq4FfhU4AtyRZN/gLUUnxmwG/hh4YVV9K8mTx9WwJGlho8zQLwIOVtWhqnoYuBHYdtKYNwLXVtW3AKrqWLdtSpKGGSXQ1wGH520fGeyb70LgwiSfTXJbkq0LFUqyI8lMkpnZ2dlT61iStKCuLoquATYDFwPbgfcmOe/kQVW1p6qmq2p6amqqo6+WJMFogX4U2DBve/1g33xHgH1V9aOq+grwReYCXpK0QkYJ9DuAzUkuSHIOcDmw76QxH2Vudk6StcydgjnUXZuSpGGGBnpVHQd2ArcA9wE3V9W9Sa5Oculg2C3Ag0kOAJ8G/rCqHhxX05Kkxxq6bBGgqvYD+0/ad9W8zwW8ZfCPJGkVeKeoJDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRIwV6kq1J7k9yMMmVS4z7zSSVZLq7FiVJoxga6EnOAq4FLgG2ANuTbFlg3OOBPwBu77pJSdJwo8zQLwIOVtWhqnoYuBHYtsC4dwDvBH7QYX+SpBGNEujrgMPzto8M9v1YkucCG6rqX5cqlGRHkpkkM7Ozs8tuVpK0uNO+KJrkccBfAW8dNraq9lTVdFVNT01Nne5XS5LmGSXQjwIb5m2vH+w74fHAM4DPJPkq8AJgnxdGJWlljRLodwCbk1yQ5BzgcmDfiYNV9Z2qWltVm6pqE3AbcGlVzYylY0nSgoYGelUdB3YCtwD3ATdX1b1Jrk5y6bgblCSNZs0og6pqP7D/pH1XLTL24tNvS5K0XN4pKkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqxEiBnmRrkvuTHExy5QLH35LkQJK7k/xbkqd236okaSlDAz3JWcC1wCXAFmB7ki0nDfsCMF1VzwI+BPxF141KkpY2ygz9IuBgVR2qqoeBG4Ft8wdU1aer6vuDzduA9d22KUkaZpRAXwccnrd9ZLBvMVcAn1joQJIdSWaSzMzOzo7epSRpqE4viiZ5NTANXLPQ8araU1XTVTU9NTXV5VdL0hlvzQhjjgIb5m2vH+x7lCQvA94GvLiqfthNe5KkUY0yQ78D2JzkgiTnAJcD++YPSPIc4D3ApVV1rPs2JUnDDA30qjoO7ARuAe4Dbq6qe5NcneTSwbBrgJ8BPpjkriT7FiknSRqTUU65UFX7gf0n7btq3ueXddyXJGmZvFNUkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktSIkQI9ydYk9yc5mOTKBY7/RJKbBsdvT7Kp804lSUsaGuhJzgKuBS4BtgDbk2w5adgVwLeq6mnAXwPv7LpRSdLSRpmhXwQcrKpDVfUwcCOw7aQx24B/HHz+EPDSJOmuTUnSMKmqpQcklwFbq+oNg+3XAM+vqp3zxtwzGHNksP3lwZhvnlRrB7BjsPl04P6u/kUk6Qzx1KqaWujASC+J7kpV7QH2rOR3StKZYpRTLkeBDfO21w/2LTgmyRrgCcCDXTQoSRrNKIF+B7A5yQVJzgEuB/adNGYf8LuDz5cB/17DzuVIkjo19JRLVR1PshO4BTgLuL6q7k1yNTBTVfuAvwfen+Qg8BBzoS9JWkFDL4pKkvrBO0UlqREGuiQ1wkDXWCR8NeFlfardt7rj1MeeZaBLUjMMdElqxIreKbpsc092fCPwZOAw8DaqPjLRtftWd7y1n5fwLuB84KPA71fxgw7qjrN2r+pmd57L3LLhpwGfBB4BvlS76u2nW5ue9dy3uuOoPekz9C8Dv8Lcnae7gX8mOX/Ca/et7jhrvwr4NeDngAuBLkJm3LV7Uze7cw7wEWAv8ETgBuAVp1t3nt703Le646o92YFe9UGqvkHVI1TdBHyJuac/Tm7tvtUdb+2/reJwFQ8Bfw5s76DmuGv3qe4LmPsr+121q35Uu+rDwH92UPeEPvXct7pjqT3ZgZ68luQukm+TfBt4BrB2omv3re54ax+e9/lrwFM6qDnu2n2q+xTgaO161N2BhxcbfAr61HPf6o6l9uQGevJU4L3ATuBJVJ0H3AOc/nPWx1W7b3XHXfvRD3XbCHyjg5rjrt2nug8A67L7Ue8e2LDY4FPQp577VncstSc30OGngQJmAUhez9zMcZJr963uuGu/KWF9whOBtwE3dVR3nLX7VPfzwP8CO7M7a7I72+jqNNycPvXct7pjqT25gV51APhL5v6l/wd4JvDZia7dt7rjrg0fAD4FHGLuwuufdVR3nLV7U7d21cPAbzD3CshvA68GPg788HRrD/Sm577VHVdtH84lNSS7cztwXe2qf1jtXkY1rp77VreL2pO9Dl3SkrI7L2buVY7fZG6Z4bOYW888scbVc9/qjqO2gS7129OBm5m7FnIIuKx21QOr29JQ4+q5b3U7r+0pF0lqxOReFJUkLYuBLkmNMNAlqREGuiQ1wkCXpEb8HwWeTMfcGV1jAAAAAElFTkSuQmCC\n"},"metadata":{"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"hbstBcfgVJP6"},"source":["Παρατηρούμε ότι όντως τοποθετεί με αρκετά καλή ακρίβεια τα κλαδιά που περιέχουν κείμενα από την ίδια κατηγορία. Επίσης ο αλγόριθμος αυτός μπορεί να εντοπίσει και ιεραρχίες εντός της κάθε ομάδας. (Σημ. σε ένα πραγματικό unsupervised πρόβλημα τα χρώματα και τα label στον άξονα x **δεν** είναι διαθέσιμα).\n"]},{"cell_type":"markdown","metadata":{"id":"19cZxmnnhtu3"},"source":["\n","Σημειώστε επίσης την τεράστια επίδραση στις διαστάσεις των διανυσμάτων (το 1/10) που έχουν οι παράμετροι του TfidfVectorizer. Στην πράξη προσπαθούμε να μικρύνουμε όσο γίνεται τις διαστάσεις μέχρι το σημείο που αρχίζει να πέφτει η ποιότητα (της κατηγοριοποίησης, του clustering κοκ).\n"]},{"cell_type":"markdown","metadata":{"id":"axdk56-nhvyn"},"source":["\n","### k-Means και πλήθος clusters\n","\n","Θα δοκιμάσουμε επίσης για το ίδιο πρόβλημα και τον **k-means**, σε περισσότερα κείμενα. Πρώτα φορτώνουμε τα κείμενα..."]},{"cell_type":"code","metadata":{"id":"yiUI7OoLVJP6","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433810644,"user_tz":-120,"elapsed":1251,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"43584457-134a-4a88-de24-0cf63862db76"},"source":["categ = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']\n","data = functools.reduce(lambda x,y: x+y, [fetch_20newsgroups(categories=[x], remove=('headers', 'footers'))['data'][:100] for x in categ])\n","print('Σύνολο κειμένων:', len(data))"],"execution_count":34,"outputs":[{"output_type":"stream","name":"stdout","text":["Σύνολο κειμένων: 300\n"]}]},{"cell_type":"markdown","metadata":{"id":"PVXBumV-VJP9"},"source":["Στη συνέχεια εφαρμόζουμε την προεπεξεργασία και τρέχουμε τον k-means για διάφορα k για να βρούμε το βέλτιστο."]},{"cell_type":"code","metadata":{"id":"fgJO85EcVJP-","executionInfo":{"status":"ok","timestamp":1668433812067,"user_tz":-120,"elapsed":1426,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["from sklearn.cluster import KMeans\n","from sklearn.metrics import silhouette_score\n","tf_idf_array = vectorizer.fit_transform(data)\n","\n","silhouette_scores = []\n","for k in range(2, 10):\n"," km = KMeans(k)\n"," preds = km.fit_predict(tf_idf_array)\n"," silhouette_scores.append(silhouette_score(tf_idf_array, preds))"],"execution_count":35,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"b4FUiTrfVJQA"},"source":["Σχεδιάζουμε τη γραφική του silhouette και βρίσκουμε το βέλτιστο k. Αυτό αντιπροσωπεύει τον αριθμό των κατηγοριών στις οποίες ανήκουν τα κείμενά μας."]},{"cell_type":"code","metadata":{"id":"vrnA-CPuVJQB","colab":{"base_uri":"https://localhost:8080/","height":284},"executionInfo":{"status":"ok","timestamp":1668433812070,"user_tz":-120,"elapsed":28,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"cca606d6-98ad-4caa-c024-66f22cc8d222"},"source":["plt.plot(range(2, 10), silhouette_scores)\n","best_k = np.argmax(silhouette_scores) + 2 # +2 γιατί ξεκινάμε το range() από k=2 και όχι από 0 που ξεκινάει η αρίθμηση της λίστας\n","plt.scatter(best_k, silhouette_scores[best_k-2], color='r') # για τον ίδιο λόγο το καλύτερο k είναι αυτό 2 θέσεις παρακάτω από το index της λίστας\n","plt.xlim([2,9])\n","plt.annotate(\"best k\", xy=(best_k, silhouette_scores[best_k-2]), xytext=(5, silhouette_scores[best_k-2]),arrowprops=dict(arrowstyle=\"->\")) # annotation\n","print('Maximum average silhouette score for k =', best_k)"],"execution_count":36,"outputs":[{"output_type":"stream","name":"stdout","text":["Maximum average silhouette score for k = 3\n"]},{"output_type":"display_data","data":{"text/plain":["
"],"image/png":"\n"},"metadata":{"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"5PmC1zVwVJQG"},"source":["Με το κριτήριο silhouette βρήκαμε 3 cluster στα κείμενά μας, όσες κατηγορίες είχαμε και αρχικά.\n","Ας εκτυπώσουμε τις ετικέτες που μας δίνει ο k-means:"]},{"cell_type":"code","metadata":{"id":"00e7BytwVJQG","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433812071,"user_tz":-120,"elapsed":22,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"867eccb0-1874-4f1e-d944-3060676499ae"},"source":["km = KMeans(best_k)\n","km.fit(tf_idf_array)\n","print(km.labels_)"],"execution_count":37,"outputs":[{"output_type":"stream","name":"stdout","text":["[2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2\n"," 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 0\n"," 1 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1\n"," 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n"," 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n"," 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 0 0 0 0 0 2 0 1 0 2 0 0 0 1 1 1 0 0 0 0\n"," 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0\n"," 0 0 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 2 0 1 0 1 0 0\n"," 1 1 0 0]\n"]}]},{"cell_type":"markdown","metadata":{"id":"qlwgP3UfVJQN"},"source":["Ξέρουμε ότι στο σύνολό μας, τα πρώτα 100 κείμενα ανήκουν στην 1η κατηγορία, τα επόμενα 100 στη δεύτερη, κτλ. Από τις παραπάνω προβλέψεις βλέπουμε ότι τα έχει πάει αρκετά καλά ο k-means. Σημειώστε ότι το label δεν έχει σημασία. \n","Για να δούμε για ποιο πράγμα μιλάει η κάθε κατηγορία, μπορούμε να βρούμε τους top όρους για κάθε ομάδα."]},{"cell_type":"code","metadata":{"id":"oDPxSPMmVJQO","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433812073,"user_tz":-120,"elapsed":16,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"3720432f-7775-435a-da2c-6e1627046e96"},"source":["terms = vectorizer.get_feature_names_out()\n","order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n","for i in range(best_k):\n"," out = \"Cluster %d:\" % i\n"," for ind in order_centroids[i, :20]:\n"," out += ' %s' % terms[ind]\n"," print(out)"],"execution_count":38,"outputs":[{"output_type":"stream","name":"stdout","text":["Cluster 0: year team game edu better games hitter like good hit baseball season play pitching cubs article clemens win alomar fans\n","Cluster 1: thanks image graphics know edu files hi does file mail program article help just information book card use looking com\n","Cluster 2: god edu people think don article atheism believe com say just does exist islam argument wrong bible moral objective morality\n"]}]},{"cell_type":"markdown","metadata":{"id":"Z-C8XiNYVJQR"},"source":["Οι όροι βλέπουμε ότι έχουν όντως σχέση με το περιεχόμενο των κειμένων. Μπορούμε να τυπώσουμε και περισσότερα clusters και να διαπιστώσουμε ότι και αυτά έχουν σημασιολογική συνοχή."]},{"cell_type":"code","metadata":{"id":"Wu2sqzOdVJQS","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1668433812628,"user_tz":-120,"elapsed":18,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"9c69e460-5f2f-4ece-cd77-44ef3d5542bf"},"source":["km = KMeans(10)\n","km.fit(tf_idf_array)\n","order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n","for i in range(10):\n"," out = \"Cluster %d:\" % i\n"," for ind in order_centroids[i, :20]:\n"," out += ' %s' % terms[ind]\n"," print(out)"],"execution_count":39,"outputs":[{"output_type":"stream","name":"stdout","text":["Cluster 0: book information copy baseball lost sent image mail appreciated steve imaging rules sun rule tiff number box appreciate mormon local\n","Cluster 1: graphics image hi thanks program looking know color vga code card file help driver use appreciated package windows points comp\n","Cluster 2: year edu game better season games clemens good article players team cubs sox hit league time know just alomar new\n","Cluster 3: islam jaeger bu islamic muslim rushdie gregg muslims edu uk god article law buphy laws khomeini true bcci book religion\n","Cluster 4: moral morality edu psuvm think natural uiuc species team immoral psu don win examples wrong saying humans fellow cso ico\n","Cluster 5: god atheism edu exist believe atheists don existence value people atheist think just mean say bible article ultb rit isc\n","Cluster 6: thanks does mike play position lansing advance texas bsa line colorado plus transform baseman address accepted algorithm super add cold\n","Cluster 7: come edu jewish john hitter article mary maine baseball joseph somebody got maybe knew stadium just does colorado gov msstate\n","Cluster 8: files com thanks article file read pd just look hdf edu gl newsgroup says faq people use viewer love earth\n","Cluster 9: people mathew don like com war say ellipse religion liar good uk mantis better commit bible koresh die yes crime\n"]}]},{"cell_type":"markdown","metadata":{"id":"fIMe-MMjjBXn"},"source":["Σημειώστε ότι η παρουσία διάφορων όρων όπως edu, use, thanks που είτε υπάρχουν παντού είτε έχουν προφανώς μικρή σημασιολογική αξία, δείχνει ότι θα μπορούσαμε να κάνουμε ακόμα καλύτερη προεπεξεργασία και διανυσματική αναπαράσταση. "]}]}