{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.6"},"colab":{"provenance":[]}},"cells":[{"cell_type":"markdown","metadata":{"_uuid":"84163f3ca19c0b7c9fda47121b3bc4cadfaf1fcc","id":"Y2IO-nbZH30V"},"source":["# Gensim Word2Vec με τους Simpsons"]},{"cell_type":"markdown","metadata":{"_uuid":"7d96105f0c90bf052b2afdb684bf31549e1e6c81","id":"9VyTOfrtH30b"},"source":["# Ξεκινώντας\n","\n","## Ρύθμιση του περιβάλλοντος\n","\n","Βιβλιοθήκες:\n"," * `xlrd` https://pypi.org/project/xlrd/\n"," * `spaCy` https://spacy.io/usage/\n"," * `gensim` https://radimrehurek.com/gensim/install.html\n"," * `scikit-learn` http://scikit-learn.org/stable/install.html\n"," * `seaborn` https://seaborn.pydata.org/installing.html"]},{"cell_type":"code","metadata":{"_uuid":"cc7b3e6ca62670ff13626705402f626778487204","id":"47mkSHRcH30c","executionInfo":{"status":"ok","timestamp":1668436902722,"user_tz":-120,"elapsed":8352,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["import re # For preprocessing\n","import pandas as pd # For data handling\n","from time import time # To time our operations\n","from collections import defaultdict # For word frequency\n","\n","import spacy # For preprocessing\n","\n","import logging # Setting up the loggings to monitor gensim\n","logging.basicConfig(format=\"%(levelname)s - %(asctime)s: %(message)s\", datefmt= '%H:%M:%S', level=logging.INFO)"],"execution_count":1,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"8ec7ce784f6d9e7f71e2b5789b1e65ec4414628b","id":"ghy0vFnUH30e"},"source":["\"drawing\"\n","\n","## Τα δεδομένα\n","\n","Θα δουλέψουμε με το σύνολο δεδομένων των σεναρίων των \"The Simpsons\". Αυτό το σύνολο δεδομένων περιέχει τους χαρακτήρες, τις τοποθεσίες, τις λεπτομέρειες του επεισοδίου και τις γραμμές του σεναρίου για περίπου 600 επεισόδια των Simpsons, που χρονολογούνται από το 1989. Μπορείτε να το βρείτε [εδώ](https://www.kaggle.com/ambarish/fun-in-text-mining-with-simpsons/data) (~25MB).\n","\n","Κάντε το unzip. θα χρειαστούμε το simpsons_sript_lines.csv."]},{"cell_type":"markdown","metadata":{"_uuid":"0c36323d9aa62f74ab348cda5ee0f571aa1d4a96","id":"rb_RLTJbH30f"},"source":["# Προεπεξεργασία\n","\n","Θα κρατήσουμε μόνο δύο κολόνες:\n","* `raw_character_text`: the character who speaks (can be useful when monitoring the preprocessing steps)\n","* `spoken_words`: the raw text from the line of dialogue\n","\n","Δενθα κρατίσουμε την κολόνα `normalized_text` γιατί θα κάνουμε τη δική μας προεπεξεργασία.\n","\n","Μπορείτε να βρείτε αυτή τη μορφή του dataset [εδώ](https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons).\n","\n","Κατεβάστε το csv και κάντε το upload στο colab από το αριστερό sidebar. Εναλλακτικά, δουλέψτε απευθείας στο Kaggle."]},{"cell_type":"code","metadata":{"_uuid":"6453b9c3f797e51923e030090ead659253f4e459","id":"s0tylY3QH30g","colab":{"base_uri":"https://localhost:8080/","height":331},"executionInfo":{"status":"error","timestamp":1668436903254,"user_tz":-120,"elapsed":540,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}},"outputId":"5faaaac8-91b6-4148-e277-e72fa03c39a3"},"source":["df = pd.read_csv('simpsons_script_lines.csv')\n","df.shape"],"execution_count":2,"outputs":[{"output_type":"error","ename":"FileNotFoundError","evalue":"ignored","traceback":["\u001b[0;31m---------------------------------------------------------------------------\u001b[0m","\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)","\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'simpsons_script_lines.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 309\u001b[0m \u001b[0mstacklevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstacklevel\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 310\u001b[0m )\n\u001b[0;32m--> 311\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 312\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 313\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)\u001b[0m\n\u001b[1;32m 584\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds_defaults\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 585\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 586\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 587\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 588\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 480\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 481\u001b[0m \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 482\u001b[0;31m \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 483\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 484\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 809\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"has_index_names\"\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"has_index_names\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 810\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 811\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 812\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 813\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, engine)\u001b[0m\n\u001b[1;32m 1038\u001b[0m )\n\u001b[1;32m 1039\u001b[0m \u001b[0;31m# error: Too many arguments for \"ParserBase\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1040\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mmapping\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# type: ignore[call-arg]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1041\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1042\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_failover_to_python\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/c_parser_wrapper.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, src, **kwds)\u001b[0m\n\u001b[1;32m 49\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[0;31m# open handles\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 51\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_open_handles\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 52\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandles\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 53\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/base_parser.py\u001b[0m in \u001b[0;36m_open_handles\u001b[0;34m(self, src, kwds)\u001b[0m\n\u001b[1;32m 227\u001b[0m \u001b[0mmemory_map\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"memory_map\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 228\u001b[0m \u001b[0mstorage_options\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"storage_options\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 229\u001b[0;31m \u001b[0merrors\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"encoding_errors\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"strict\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 230\u001b[0m )\n\u001b[1;32m 231\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.7/dist-packages/pandas/io/common.py\u001b[0m in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 705\u001b[0m \u001b[0mencoding\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencoding\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 706\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 707\u001b[0;31m \u001b[0mnewline\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 708\u001b[0m )\n\u001b[1;32m 709\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'simpsons_script_lines.csv'"]}]},{"cell_type":"code","metadata":{"_uuid":"c6c6bf4462fb4bc00c2abdbf65eced888219f364","id":"A0-SzXKgH30h","executionInfo":{"status":"aborted","timestamp":1668436903256,"user_tz":-120,"elapsed":16,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["df.head()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"caf6838463d56f79e92d87d5a3827fcd5f04fc54","id":"TRtqYuuEH30i"},"source":["Οι απουσιάζουσες τιμές προέρχονται από μέρη του σεναρίου όπου συμβαίνει κάτι, αλλά χωρίς διάλογο. Για παράδειγμα \"(Δημοτικό σχολείο του Σπρίνγκφιλντ: ΕΞΩΤ. ΔΗΜΟΤΙΚΌ - ΑΥΛΉ ΣΧΟΛΕΊΟΥ - ΑΠΌΓΕΥΜΑ)\""]},{"cell_type":"code","metadata":{"_uuid":"3a15727caeba1c8d10573456640d0b8b9f2f2e2d","id":"19RwZMxXH30j","executionInfo":{"status":"aborted","timestamp":1668436903258,"user_tz":-120,"elapsed":18,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["df.isnull().sum()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"885a555596d7484841ea54c94405d03d90572396","id":"RtFyLxMtH30k"},"source":["Αφαιρούμε τις απουσιάζουσες τιμές:"]},{"cell_type":"code","metadata":{"_uuid":"82cb38f176526679f66ee31e11cfe4f5eebdab51","id":"3cUnQpm0H30k","executionInfo":{"status":"aborted","timestamp":1668436903259,"user_tz":-120,"elapsed":19,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["df = df.dropna().reset_index(drop=True)\n","df.isnull().sum()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"7f07dca2a2656dcd9e0c315afa36af32a992eef7","id":"N3SIfLl9H30l"},"source":["## Καθαρισμός\n","\n","Κάνουμε λημματοποίηση και αφαιρούμε τα stopwords και τους μη αλφαβητικούς χαρακτήρες για κάθε γραμμή διαλόγου."]},{"cell_type":"code","metadata":{"_uuid":"b26a0c01c5701630d3951cfc808a9d944eea6371","id":"MkGtPMRBH30l","executionInfo":{"status":"aborted","timestamp":1668436903261,"user_tz":-120,"elapsed":21,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["#nlp = spacy.load('en', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed\n","\n","nlp = spacy.load(\"en_core_web_sm\", disable=['ner', 'parser']) # disabling Named Entity Recognition for speed\n","\n","def cleaning(doc):\n"," # Lemmatizes and removes stopwords\n"," # doc needs to be a spacy Doc object\n"," txt = [token.lemma_ for token in doc if not token.is_stop]\n"," # Word2Vec uses context words to learn the vector representation of a target word,\n"," # if a sentence is only one or two words long,\n"," # the benefit for the training is very small\n"," if len(txt) > 2:\n"," return ' '.join(txt)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"f686964722eede40e5961cd232aee7b6dd587bd1","id":"9tipABmqH30m"},"source":["Αφαίρεση μη αλφαβητικών χαρακτήρων:"]},{"cell_type":"code","metadata":{"_uuid":"b45598934171607242ca7d50f8c5f7c91411aace","id":"uL2JU5iHH30m","executionInfo":{"status":"aborted","timestamp":1668436903263,"user_tz":-120,"elapsed":22,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["brief_cleaning = (re.sub(\"[^A-Za-z']+\", ' ', str(row)).lower() for row in df['spoken_words'])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"2360160a7f326a32d56f2f18782d7ce2f4ac1def","id":"iRMy9QWLH30m"},"source":["Θα χρησιμοποιήσουμε τη μέθοδο spaCy .pipe() για να επιταχύνουμε τη διαδικασία καθαρισμού:"]},{"cell_type":"code","metadata":{"_uuid":"fa44ca458c970ca229426779e6ffcd46c2de313c","id":"QY4RfC_6H30n","executionInfo":{"status":"aborted","timestamp":1668436903268,"user_tz":-120,"elapsed":27,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["t = time()\n","\n","txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000)]\n","\n","print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"65ec76c60f9fe93bba6909f4e696f90e3e54710b","id":"XQvzAk3NH30n"},"source":["Βάζουμε τα αποτελέσματα σε ένα DataFrame για να αφαιρέσουμε απουσιάζουσες τιμές και διπλές εγγραφές:"]},{"cell_type":"code","metadata":{"_uuid":"57f1eb8382554bc592d48915a903230b5b6d6cf7","id":"A-zr33O3H30n","executionInfo":{"status":"aborted","timestamp":1668436903269,"user_tz":-120,"elapsed":12031,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["df_clean = pd.DataFrame({'clean': txt})\n","df_clean = df_clean.dropna().drop_duplicates()\n","df_clean.shape"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"31b4a744059df490ddb47ab6cdec008dc929ede3","id":"ckgULbT0H30o"},"source":["## Ν-grams\n","Χρησιμοποιούμε το πακέτο [Phrases](https://radimrehurek.com/gensim/models/phrases.html) του Gensim για να εξάγουμε αυτόματα συχνές φράσεις (n-grams).\n","\n","Για παράδειγμα θέλουμε να μπορούμε να πιάνουμε οντότητες όπως \"mr_burns\" και \"bart_simpson\"."]},{"cell_type":"code","metadata":{"_uuid":"af6d420284a0ff7a7407d4c526754ffe850d6170","id":"kaJPs_izH30o","executionInfo":{"status":"aborted","timestamp":1668436903270,"user_tz":-120,"elapsed":12028,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["from gensim.models.phrases import Phrases, Phraser"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"788aec3c82788101db25d4ca6105ee133fecae7c","id":"M42FXbanH30o"},"source":["Επειδή το `Phrases()` λαμβάνει ως είσοδο μια λίστα λιστών:"]},{"cell_type":"code","metadata":{"_uuid":"f58487ff08d8812622fd7aef36139f1c850add18","id":"n-FT5VQKH30o","executionInfo":{"status":"aborted","timestamp":1668436904335,"user_tz":-120,"elapsed":32,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["sent = [row.split() for row in df_clean['clean']]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"bb7766b322cbc1d3381912b890585eb249ac5304","id":"W70bRsiCH30p"},"source":["Δημιουργούμε τις φράσεις που μας ενδιαφέρουν από τη λίστα των προτάσεων:"]},{"cell_type":"code","metadata":{"_uuid":"8befad8c76c54bd2b831b0942a2f626f7d8a6dac","id":"UeJfSIcBH30p","executionInfo":{"status":"aborted","timestamp":1668436904336,"user_tz":-120,"elapsed":32,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["phrases = Phrases(sent, min_count=30, progress_per=10000)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"45bae4a953f2ad8951e4efb234e1e357857a33b3","id":"nBiDcxDDH30p"},"source":["Ο στόχος της Phraser() είναι να περιορίσει τη χρήση μνήμης του Phrases(), απορρίπτοντας την κατάσταση του μοντέλου που δεν μας χρειάζεται στο συγκεκριμένο task:"]},{"cell_type":"code","metadata":{"_kg_hide-input":true,"_uuid":"b8ae81ba230013aefe7c584338de7376fedf6294","id":"NjLP_-HgH30q","executionInfo":{"status":"aborted","timestamp":1668436904338,"user_tz":-120,"elapsed":33,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["ngram = Phraser(phrases)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"4a58380f19d159688aeee665d1afb96289fdd4b8","id":"HCjeXg_UH30q"},"source":["Μετατρέπουμε το σώμα κειμένου στα ngrams που έχουν ανιχνευθεί:"]},{"cell_type":"code","metadata":{"_uuid":"8051b56890c147119db3df529d3cfd3cf675fdca","id":"wHszRHuPH30q","executionInfo":{"status":"aborted","timestamp":1668436904339,"user_tz":-120,"elapsed":34,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["sentences = ngram[sent]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"a4f81e8bb2c09a67b00cd24db28353eca8ae188c","id":"EBKMFAi-H30q"},"source":["## Πιο συχνές λέξεις:\n","\n","\n","Κυρίως ένας έλεγχος ορθότητας της αποτελεσματικότητας της λημματοποίησης, της αφαίρεσης των stopwords και της προσθήκης ngrams."]},{"cell_type":"code","metadata":{"_uuid":"eeb8afe1cfcb7ba65bd14d657455600acacf39ba","id":"ZQV5j6KVH30r","executionInfo":{"status":"aborted","timestamp":1668436904340,"user_tz":-120,"elapsed":34,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["word_freq = defaultdict(int)\n","for sent in sentences:\n"," for i in sent:\n"," word_freq[i] += 1\n","len(word_freq)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"_uuid":"5b010149150b2b2eaf332d79bcde0649b8a3c2b5","id":"scNkI6EIH30r","executionInfo":{"status":"aborted","timestamp":1668436904341,"user_tz":-120,"elapsed":35,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["sorted(word_freq, key=word_freq.get, reverse=True)[:10]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"500ab7b5c84dc006d7945f339c40725a82856fdf","id":"yH86lYmPH30r"},"source":["# Εκπαίδευση του μοντέλου\n","## Υλοποίηση Word2Vec της Gensim\n","\n","Θα χρησιμοποιήσουμε την υλοποίηση [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) του GenSim. "]},{"cell_type":"code","metadata":{"_uuid":"3269be205cadbad499aa87890893d92da6adc796","id":"H2wz8WQQH30r","executionInfo":{"status":"aborted","timestamp":1668436904342,"user_tz":-120,"elapsed":36,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["import multiprocessing\n","\n","from gensim.models import Word2Vec"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"7c524bc49c41a6c37f9e754a38797c9501202090","id":"t99YtMVQH30s"},"source":["## Διαχωρισμός της εκπαίδευσης του μοντέλου σε 3 βήματα\n","\n","Διαχωρίζουμε την εκπαίδευση σε 3 διακριτά βήματα για λόγους σαφήνειας και παρακολούθησης.\n","1. `Word2Vec()`: \n",">Σε αυτό το πρώτο βήμα, ρυθμίζουμε τις παραμέτρους του μοντέλου μία προς μία.
Δεν παρέχουμε την παράμετρο `sentences`, και επομένως αφήνουμε σκόπιμα το μοντέλο μη αρχικοποιημένο.\n","2. `.build_vocab()`: \n",">Εδώ κατασκευάζεται το λεξιλόγιο από μια ακολουθία προτάσεων και αρχικοποιείται το μοντέλο.
Με τα logs, μπορούμε να παρακολουθούμε την πρόοδο και, ακόμη πιο σημαντικό, την επίδραση των `min_count` και `sample` στο σώμα των λέξεων. Παρατήρούμε ότι αυτές οι δύο παράμετροι, και ειδικότερα το `sample`, έχουν μεγάλη επίδραση στην απόδοση ενός μοντέλου. Η εμφάνιση και των δύο επιτρέπει την ακριβέστερ και ευκολότερη διαχείριση της επιρροής τους.\n","3. `.train()`:\n",">Τελικά, εκπαιδεύουμε το μοντέλο.
\n","Οι καταγραφές εδώ είναι κυρίως χρήσιμες για την παρακολούθηση, διασφαλίζοντας ότι κανένα νήμα δεν εκτελείται στιγμιαία."]},{"cell_type":"code","metadata":{"_uuid":"03488d9b68963579c96094aca88a302c9f2753a7","id":"JokZeRgJH30s","executionInfo":{"status":"aborted","timestamp":1668436904343,"user_tz":-120,"elapsed":36,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["cores = multiprocessing.cpu_count() # Count the number of cores in a computer"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"89c305fcd163488441ac2ac6133678bd973b4419","id":"2cEbD630H30s"},"source":["## Οι παράμετροι\n","\n","* `min_count` = int - Αγνοεί όλες τις λέξεις με συνολική απόλυτη συχνότητα μικρότερη από αυτή - (2, 100)\n","\n","\n","* `window` = int - Η μέγιστη απόσταση μεταξύ της τρέχουσας και της προβλεπόμενης λέξης μέσα σε μια πρόταση. Π.χ. `παράθυρο` λέξεις στα αριστερά και `παράθυρο` λέξεις στα αριστερά του στόχου μας - (2, 10)\n","\n","\n","* `size` = int - Διαστατικότητα των διανυσμάτων χαρακτηριστικών. - (50, 300)\n","\n","\n","* `sample` = float - Το κατώφλι για τη διαμόρφωση των λέξεων με υψηλότερη συχνότητα που θα υποβαθμίζονται τυχαία. Έχει μεγάλη επιρροή. - (0, 1e-5)\n","\n","\n","* `alpha` = float - Ο αρχικός ρυθμός μάθησης - (0.01, 0.05)\n","\n","\n","* `min_alpha` = float - Ο ρυθμός μάθησης θα πέφτει γραμμικά στο `min_alpha` καθώς η εκπαίδευση προχωράει. Για να το ορίσετε: alpha - (min_alpha * epochs) ~ 0.00\n","\n","\n","* `negative` = int - Αν > 0, θα χρησιμοποιηθεί αρνητική δειγματοληψία, το int για το negative καθορίζει πόσες \"λέξεις θορύβου\" θα πρέπει να \"τραβηχτούν\". Αν τεθεί σε 0, δεν χρησιμοποιείται αρνητική δειγματοληψία. - (5, 20)\n","\n","\n","* `workers` = int - Χρησιμοποιήστε τόσα νήματα εργασίας για την εκπαίδευση του μοντέλου (=γρηγορότερη εκπαίδευση με πολυπύρηνες μηχανές)"]},{"cell_type":"code","metadata":{"_uuid":"ad619db82c219d6cb81fad516563feb0c4d474cd","id":"_lPMZA9KH30t","executionInfo":{"status":"aborted","timestamp":1668436904345,"user_tz":-120,"elapsed":37,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model = Word2Vec(min_count=20,\n"," window=2,\n"," size=300,\n"," sample=6e-5, \n"," alpha=0.03, \n"," min_alpha=0.0007, \n"," negative=20,\n"," workers=cores-1)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"d7e9f1bd338f9e15647b5209ffd8fbb131cd7ee5","id":"LdR8mX30H30t"},"source":["## Κατασκευή του πίνακα λεξιλογίου\n","\n","Το Word2Vec απαιτεί από εμάς να δημιουργήσουμε έναν πίνακα λεξιλογίου (απλά συγχωνεύοντας όλες τις λέξεις και φιλτράροντας τις μοναδικές λέξεις και κάνοντας κάποιες βασικές μετρήσεις σε αυτές):"]},{"cell_type":"code","metadata":{"_uuid":"66358ad743e05e17dfbed3899af9c41056143daa","id":"Rcr_Zl7YH30t","executionInfo":{"status":"aborted","timestamp":1668436904346,"user_tz":-120,"elapsed":38,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["t = time()\n","\n","w2v_model.build_vocab(sentences, progress_per=10000)\n","\n","print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"63260d82061abb47db7f2f8b23e07ec629adf5a9","id":"HfcoUksLH30u"},"source":["## Εκπαίδευση του μοντέλου\n","_Παράμετροι της εκπαίδευσης:_\n","* `total_examples` = int - Πλήθος προτάσεων;\n","* `epochs` = int - Πλήθος επαναλήψεων (epochs)του σώματος των κειμένων - [10, 20, 30]"]},{"cell_type":"code","metadata":{"_uuid":"07a2a047e701e512fd758edff186daadaeea6461","id":"5kxTMF89H30u","executionInfo":{"status":"aborted","timestamp":1668436904348,"user_tz":-120,"elapsed":39,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["t = time()\n","\n","w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)\n","\n","print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"48e12768512b82c2d5cf6a543e3a9f2515699a22","id":"akE806J4H30u"},"source":["Καθώς δεν σκοπεύουμε να εκπαιδεύσουμε περαιτέρω το μοντέλο, καλούμε την init_sims(), η οποία θα κάνει το μοντέλο πολύ πιο αποδοτικό στη μνήμη:"]},{"cell_type":"code","metadata":{"_uuid":"34dd51c7f2f39d016b982ef81e4df576f6b31bcb","id":"r8RLWrw_H30u","executionInfo":{"status":"aborted","timestamp":1668436904349,"user_tz":-120,"elapsed":40,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.init_sims(replace=True)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"a420d5a98eb860cff1f4bbac8cbe2054459b6200","id":"1ia3YrdGH30v"},"source":["# Εξερευνώντας το μοντέλο\n","## Most similar\n","\n","Εδώ, θα ζητήσουμε από το μοντέλο μας να βρει τη λέξη που μοιάζει περισσότερο με μερικούς από τους πιο εμβληματικούς χαρακτήρες των Simpsons."]},{"cell_type":"markdown","metadata":{"_uuid":"a8f3cfd8ac88978a4df31c90afa194bd6fa4f3f5","id":"kw8U8sw-H30v"},"source":["\"drawing\"\n","\n","Ας δούμε τι θα πάρουμε για τον κύριο χαρακτήρα της σειράς:"]},{"cell_type":"code","metadata":{"_uuid":"339207a733a1ac42fe60e32a29f9e5d5ca0a9275","id":"sbWP7zwsH30v","executionInfo":{"status":"aborted","timestamp":1668436904351,"user_tz":-120,"elapsed":41,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.most_similar(positive=[\"homer\"])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"3b6686e6fa956a98450259b063b4cf51019a6d0b","id":"DF57IrvgH30w"},"source":["_Μια διευκρίνιση:_
\n","Το σύνολο δεδομένων είναι οι ατάκες διαλόγου των Simpsons- επομένως, όταν εξετάζουμε τις πιο παρόμοιες λέξεις με το \"homer\", **δεν** είναι απαραίτητο να πάρουμε τα μέλη της οικογένειάς του, τα χαρακτηριστικά της προσωπικότητάς του ή ακόμη και τις πιο συχνές εκφράσεις του. Αντί αυτού, παίρνουμε τι είπαν άλλοι χαρακτήρες (καθώς ο Homer δεν αναφέρεται συχνά στον εαυτό του στο 3ο πρόσωπο) μαζί με το \"homer\", όπως πώς αισθάνεται ή φαίνεται (\"depressed\"), πού βρίσκεται (\"hammock\"), ή με ποιον είναι (\"marge\").\n","\n","Ας δούμε τι μας δίνει συγκριτικά το bigram \"homer_simpson:"]},{"cell_type":"code","metadata":{"_uuid":"23e5149b19f18f4f2f456d4c72afc5c188bcfba4","id":"IvC0qO7vH30w","executionInfo":{"status":"aborted","timestamp":1668436904352,"user_tz":-120,"elapsed":42,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.most_similar(positive=[\"homer_simpson\"])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"0e3e121e022f2f659cf97bba42cecd3f3c9afb01","id":"AIxSuvB9H30w"},"source":["\"drawing\"\n","\n","Τώρα η Margie:"]},{"cell_type":"code","metadata":{"_uuid":"22595f98c675a9697243b7e826b2840e5fc3e5f5","id":"Bu9QmHeJH30w","executionInfo":{"status":"aborted","timestamp":1668436904353,"user_tz":-120,"elapsed":43,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.most_similar(positive=[\"marge\"])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"2b4857ff1159695c72c22417cf52dc84e0dfc9ea","id":"Kq7ahggaH30x"},"source":["\"drawing\"\n","\n","O Bart:"]},{"cell_type":"code","metadata":{"_uuid":"ac9ba47738e596dce6552099e76f303f28577943","id":"osngbA-GH30x","executionInfo":{"status":"aborted","timestamp":1668436904354,"user_tz":-120,"elapsed":44,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.most_similar(positive=[\"bart\"])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"d8b5937dfd7584f168a33060c435036cad5b390b","id":"ELuGALXuH30x"},"source":["## Ομοιότητες:\n","\n","Εδώ εξετάζουμε πόσο όμοιες δύο οντότητες (ngrams) μεταξύ τους:"]},{"cell_type":"markdown","metadata":{"_uuid":"d31383aa2f6310a38ec671f2ca4b0fcb195551dd","id":"KYKp4Cr6H30x"},"source":["\"drawing\""]},{"cell_type":"code","metadata":{"_uuid":"349828078b5a438d93e5494478e88095913dc58e","id":"nATnRjjyH30y","executionInfo":{"status":"aborted","timestamp":1668436904356,"user_tz":-120,"elapsed":45,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.similarity('maggie', 'baby')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"zFdk3UnVH30y"},"source":["\"drawing\""]},{"cell_type":"code","metadata":{"_uuid":"9ee5e2532214b20fef0a597bc5ad355762fcc281","id":"oZFqwLaEH30y","executionInfo":{"status":"aborted","timestamp":1668436904357,"user_tz":-120,"elapsed":46,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.similarity('bart', 'nelson')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"08a999d758ac687d626b631a8ce393eaa26f41e7","id":"1dHzi-J8H30z"},"source":["## Αταίριαστη λέξη:\n","\n","Εδώ ζητάμε από το μοντέλο να μας δώσει από μια λίστα την λέξη που δεν ταιριάζει!\n","\n","Μεταξύ του Jimbo, του Milhouse και του Kearney, ποιος είναι αυτός που δεν είναι νταής;"]},{"cell_type":"code","metadata":{"_uuid":"d982e44d9c212b5ee09bcaebd050a725ab5e508e","id":"UH0B60WmH30z","executionInfo":{"status":"aborted","timestamp":1668436904374,"user_tz":-120,"elapsed":62,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"12d3c035c89718e70c193f095b46e38dccef6b0d","id":"Bn4atCNQH30z"},"source":["\"drawing\"\n","\n","Τι θα λέγατε αν συγκρίναμε τη φιλία μεταξύ του Nelson, του Bart, και του Milhouse;"]},{"cell_type":"code","metadata":{"_uuid":"cafd4a7bec6d6255ea3f5f06df951546c0d783a9","id":"KUj4BrGXH30z","executionInfo":{"status":"aborted","timestamp":1668436904375,"user_tz":-120,"elapsed":63,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.doesnt_match([\"nelson\", \"bart\", \"milhouse\"])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"01c0148dc758db74ed8078ca54e8393ada090c8c","id":"59aNZHV7H300"},"source":["Τέλος, πώς είναι η σχέση μεταξύ του Όμηρου και των δύο κουνιάδων του;"]},{"cell_type":"code","metadata":{"_uuid":"445912f7d89b3cb1550926be161d134e6689f54f","id":"msBPLEZzH300","executionInfo":{"status":"aborted","timestamp":1668436904377,"user_tz":-120,"elapsed":66,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.doesnt_match(['homer', 'patty', 'selma'])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"df95cdff693e843ab4b4c174fea24029447573cd","id":"vftMtxw4H300"},"source":["## Αναλογική διαφορά\n","\n","Ποια λέξη είναι ως προς το woman αυτό που είναι ο homer ως προς την marge;"]},{"cell_type":"code","metadata":{"_uuid":"812961e79dde9f2032f708755ca287c0aef838d0","id":"OGIK2azjH300","executionInfo":{"status":"aborted","timestamp":1668436904381,"user_tz":-120,"elapsed":69,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.most_similar(positive=[\"woman\", \"homer\"], negative=[\"marge\"], topn=3)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"5a4ff4a8c8c582c6d9c704a042e8cc5d18b7bd6c","id":"bQ0Cx5t-H300"},"source":["Ποια λέξη είναι ως προς τo woman αυτό που είναι ο Bart ως προς το man;"]},{"cell_type":"code","metadata":{"_uuid":"4cfef57b94b635abb58a4ff191785506c78ec9d9","id":"C4xm6gEQH300","executionInfo":{"status":"aborted","timestamp":1668436904383,"user_tz":-120,"elapsed":71,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["w2v_model.wv.most_similar(positive=[\"woman\", \"bart\"], negative=[\"man\"], topn=3)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"ef520bd7dd974d14afcb8e69266579ac0b703714","id":"IhoqcU3nH301"},"source":["\"drawing\""]},{"cell_type":"markdown","metadata":{"_uuid":"773c0acc8750ba8e728ff261f2e9ec39694c245c","id":"obD8NABkH301"},"source":["### Οπτικοποίηση t-SNE\n","[Ο T-distributed Stochastic Neighbor Embedding (t-SNE)](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) είναι ένας μη γραμμικός αλγόριθμος μείωσης της διαστατικότητας που προσπαθεί να αναπαραστήσει δεδομένα υψηλής διάστασης και τις υποκείμενες σχέσεις μεταξύ τους σε ένα χώρο χαμηλότερης διάστασης.
"]},{"cell_type":"code","metadata":{"_uuid":"27ec46110042fc28da900b1b344ae4e0692d5dc2","id":"Nw45tEfKH301","executionInfo":{"status":"aborted","timestamp":1668436904391,"user_tz":-120,"elapsed":78,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["import numpy as np\n","import matplotlib.pyplot as plt\n","%matplotlib inline\n"," \n","import seaborn as sns\n","sns.set_style(\"darkgrid\")\n","\n","from sklearn.decomposition import PCA\n","from sklearn.manifold import TSNE"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"22693eaa25253b38cee3c5cd5db6b6fdddb575a4","id":"s6rI-rPmH301"},"source":["Ο στόχος μας σε αυτή την ενότητα είναι να σχεδιάσουμε τα διανύσματα των 300 διαστάσεων σε δισδιάστατα γραφήματα και να δούμε αν μπορούμε να εντοπίσουμε ενδιαφέροντα μοτίβα.
\n","Για το σκοπό αυτό θα χρησιμοποιήσουμε την υλοποίηση t-SNE από το scikit-learn.\n","\n","Για να κάνουμε τις απεικονίσεις πιο ευδιάκριτες, θα εξετάσουμε τις σχέσεις μεταξύ μιας λέξης ερωτήματος (σε **κόκκινο**), των πιο παρόμοιων λέξεων της στο μοντέλο (σε **μπλε**) και άλλων λέξεων από το λεξιλόγιο (σε **πράσινο**)."]},{"cell_type":"code","metadata":{"_uuid":"489a7d160dcd92da0ce42a3b5b461368c9ffe5f1","id":"tMrz0UTKH301","executionInfo":{"status":"aborted","timestamp":1668436904392,"user_tz":-120,"elapsed":79,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["def tsnescatterplot(model, word, list_names, n_components=50):\n"," \"\"\" Plot in seaborn the results from the t-SNE dimensionality reduction algorithm of the vectors of a query word,\n"," its list of most similar words, and a list of words.\n"," \"\"\"\n"," arrays = np.empty((0, 300), dtype='f')\n"," word_labels = [word]\n"," color_list = ['red']\n","\n"," # adds the vector of the query word\n"," arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)\n"," \n"," # gets list of most similar words\n"," close_words = model.wv.most_similar([word])\n"," \n"," # adds the vector for each of the closest words to the array\n"," for wrd_score in close_words:\n"," wrd_vector = model.wv.__getitem__([wrd_score[0]])\n"," word_labels.append(wrd_score[0])\n"," color_list.append('blue')\n"," arrays = np.append(arrays, wrd_vector, axis=0)\n"," \n"," # adds the vector for each of the words from list_names to the array\n"," for wrd in list_names:\n"," wrd_vector = model.wv.__getitem__([wrd])\n"," word_labels.append(wrd)\n"," color_list.append('green')\n"," arrays = np.append(arrays, wrd_vector, axis=0)\n"," \n"," # Reduces the dimensionality from 300 to 50 dimensions with PCA\n"," reduc = PCA(n_components).fit_transform(arrays)\n"," \n"," # Finds t-SNE coordinates for 2 dimensions\n"," np.set_printoptions(suppress=True)\n"," \n"," Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)\n"," \n"," # Sets everything up to plot\n"," df = pd.DataFrame({'x': [x for x in Y[:, 0]],\n"," 'y': [y for y in Y[:, 1]],\n"," 'words': word_labels,\n"," 'color': color_list})\n"," \n"," fig, _ = plt.subplots()\n"," fig.set_size_inches(9, 9)\n"," \n"," # Basic plot\n"," p1 = sns.regplot(data=df,\n"," x=\"x\",\n"," y=\"y\",\n"," fit_reg=False,\n"," marker=\"o\",\n"," scatter_kws={'s': 40,\n"," 'facecolors': df['color']\n"," }\n"," )\n"," \n"," # Adds annotations one by one with a loop\n"," for line in range(0, df.shape[0]):\n"," p1.text(df[\"x\"][line],\n"," df['y'][line],\n"," ' ' + df[\"words\"][line].title(),\n"," horizontalalignment='left',\n"," verticalalignment='bottom', size='medium',\n"," color=df['color'][line],\n"," weight='normal'\n"," ).set_size(15)\n","\n"," \n"," plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)\n"," plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)\n"," \n"," plt.title('t-SNE visualization for {}'.format(word.title()))\n"," "],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"3943c170a5f5f09974d90c89bdb9ec761e63a416","id":"9rTlMOK2H302"},"source":["## 10 πιο παρόμοιες λέξεις έναντι 8 τυχαίων λέξεων\n","\n","Ας συγκρίνουμε πού βρίσκεται η διανυσματική αναπαράσταση του homer, των 10 πιο παρόμοιων λέξεων του από το μοντέλο, καθώς και 8 τυχαίων λέξεων, σε ένα δισδιάστατο γράφημα:"]},{"cell_type":"code","metadata":{"_uuid":"18d788b2a92f94771a5f9485a885d44dfba62a94","id":"_lvPn-H5H302","executionInfo":{"status":"aborted","timestamp":1668436904393,"user_tz":-120,"elapsed":80,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["tsnescatterplot(w2v_model, 'homer', ['dog', 'bird', 'ah', 'maude', 'bob', 'mel', 'apu', 'duff'], n_components=8)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"c73fc2faaf0baecc84f02a97b50cb9ccefa48686","id":"RmJ8Q9j5H302"},"source":["## 10 πιο παρόμοιες λέξεις vs. 10 πιο ανόμοιες\n","\n","Αυτή τη φορά, ας συγκρίνουμε πού βρίσκεται η διανυσματική αναπαράσταση της Maggie και των 10 πιο όμοιων λέξεων της από το μοντέλο σε σύγκριση με τη διανυσματική αναπαράσταση των 10 πιο ανόμοιων λέξεων με τη Maggie:"]},{"cell_type":"code","metadata":{"_uuid":"10c77b072f7c281f2be919341be116565c20d8a8","id":"J6JH1CBlH302","executionInfo":{"status":"aborted","timestamp":1668436904394,"user_tz":-120,"elapsed":81,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["tsnescatterplot(w2v_model, 'maggie', [i[0] for i in w2v_model.wv.most_similar(negative=[\"maggie\"])], n_components=10)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"_uuid":"87315bfbaceb3733bd7af035db6c59cfc4b1ba7f","id":"tF3SPzIjH302"},"source":["## 10 πιο παρόμοιες λέξεις vs. 11η έως 20η πιο παρόμοιες λέξεις\n","\n","Τέλος, θα παρουσιάσουμε τις πιο παρόμοιες λέξεις με τον κ. Burns που κατατάσσονται από την 1η έως τη 10η θέση σε σχέση με αυτές που κατατάσσονται από την 11η έως την 20η θέση:"]},{"cell_type":"code","metadata":{"_uuid":"e6f0bc598922f4f2cd17d2511560242a3c35fdd9","id":"Eb7dlifAH302","executionInfo":{"status":"aborted","timestamp":1668436904395,"user_tz":-120,"elapsed":82,"user":{"displayName":"Giorgos Siolas","userId":"10127542075805046236"}}},"source":["tsnescatterplot(w2v_model, \"mr_burn\", [t[0] for t in w2v_model.wv.most_similar(positive=[\"mr_burn\"], topn=20)][10:], n_components=10)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"PGhAHymQH303"},"source":["\"drawing\""]}]}