EECS110 Course
Project Option 2, Spring 2018
This project offers you an opportunity to use Python to
develop a fun and useful web application.
A text cloud is a collection of the most-commonly
used words within some body of text, usually with attention paid to avoid
extremely common words (e.g., the)
and to unify different forms of a single word (e.g., aliens and alien are the same). Text clouds
allow us to obtain a quick sense of the topic of a website by visualizing the
most frequently occurring words on that website where the words are displayed
with size proportional to the frequency. For example, here is how a text cloud
might look like for Harry Potter and the Sorcerer's Stone:
Although the Harry Potter books are under
copyright and thus aren't available online, you can find many interesting older
books at Project Gutenberg. Here is a
text cloud for Mark Twain's The Adventures of Tom
Sawyer:
This
site seems to be one of the first to distinguish text clouds from the more
common tag cloud. The
kinds of text cloud this project uses are what Wikipedia calls a data clouds.
Text clouds, in essence, are a word-by-word summary of
the contents of an article, book, or other work. Though certainly the structure
of the document is lost, the relative frequency of particular words may be
telling. Consider for example, the beautiful example of the text cloud for all presidential
state-of-the-union speeches in American history.
A
Few Words on Web Pages, HTML, and URLs
Web pages are written in a language called HTML which
stands for Hyper Text Markup Language. HTML provides instructions that tell
your browser how to format the contents of the web page. If you use your web
browser to look at a web page, you can view the HTML that produced that page
(for example in Firefox, select the "View" menu and then the
"Page Source" option). If you haven't looked at HTML before, take a
few minutes to do so. Don't worry if you don't understand the HTML commands,
but it's instructive just to see how the text and formatting commands appear in
HTML.
You will notice that in addition to text and some other
items in an HTML document there are often some hyperlinks. These are just the
addresses of other web pages. In a HTML document, those hyperlinks are
specified using addresses called URLs (Universal Resource Locaters). For
example, the HTML code
<A HREF="http://www.northwestern.edu/>Northwestern
University website</A>
tells the browser to display the words Northwestern University
website (generally
in a special color, such as blue) and when the user clicks on these words, the
browser should go to the web page at the location given by the URL http://www.northwestern.edu.
This project comprises two parts. The first part is the
milestone is due on due on Sunday, 06/03 at 11:59 PM.
For the milestone, you will write the core functionality of the text clouds
program but without some of the advanced features that you will add for the
final version.
The milestone program will run from the IDLE
environment. It will ask the user for a URL and will then compute the word
frequencies on the web page at that URL address. Finally, it will print the
most frequently occurring words on the screen along with the number of
occurrences of each word. You will not run this program through a web browser
at this point. That comes later!
In the final version of the project, due on Sunday, 06/10 at 11:59 PM, you will add several
additional features. First, when the user enters a URL, your program will not
only examine the words on that website, but it will also explore all of the
URLs that are linked from that website to explore the contents of those linked
page. These linked pages may, in turn, have their own URL links and your
program will explore those as well, etc. This process of "fanning
out" from an initial website is called "web crawling". Your
program will be a text cloud web crawler! In addition, your program will run
from your own web page (details below) and will display the results on your web
browser. Once you have this working, your friends will be able to go to your
website to use your text cloud web crawler.
Quick Notes on Design, Implementation,
and Test
Part
1: The Basic Program for the Milestone
This part is due on Sunday,
06/03
at 11:59 PM.
The basic program runs from IDLE and does the
following:
http://www.northwestern.edu
), complete
with the http://
prefix.
Don't forget that prefix when entering a URL. MAXWORDS
. This global
variable dictates the number of words that are returned. For example if MAXWORDS
is set to 50,
only the 50 most frequently occurring words will be returned. MAXWORDS
most
frequently occurring strings is printed on the screen. The words appear in
order from most frequent to least frequent, with each word followed by the
number of times it occurs (in parentheses). This is a simple version of a
text cloud! For the milestone, you may have very simple and
far-from-perfect "cleaning" and "stemming". You should have
functions in place for those tasks and some rudimentary "cleaning"
and "stemming" should take place, but you can refine this
substantially in the final version of your program.
For example,
here is what your program might look like when your program is run on the web
page http://networks.cs.northwestern.edu/EECS110-s18/projects/project2/page2.htm
Enter a URL:
http://networks.cs.northwestern.edu/EECS110-s18/projects/project2/page2.htm
Here is the text cloud for your web page:
spam (8)
page (2)
love (1)
Your output might not look exactly like this. In
particular, you may have other strings in your output or slightly different
counts, depending on your stemming rules. Notice here that the word
"I" appeared on that web page but was not included in the text cloud
because it got "cleaned" out as the program dropped frequently
occurring words in its stop list.
What
You'll Need for the Basic Program
If by any chance you are NOT using Python Version 3.5, download the file hmc_urllib
by right-clicking here: hmc_urllib.py and
saving that file in the directory where you are doing your programming.
FOR PYTHON VERSION 3.5, download the file hmc_urllib_new
by right-clicking here: hmc_urllib_new.py and
saving that file in the directory where you are doing your programming. If you
download the other file , the program wont work so make sure to download the
hmc_urllib_new.py
Make sure that this file is in the same directory or
folder where you are developing your project. Now, you will need the following
line at the top of your program:
from hmc_urllib import getHTML (FOR PYTHON VERSION 3.5...you
would write: from hmc_urllib_new import getHTML)
This line gives you access to a Python library for
dealing with HTML. It provides the function getHTML
; you can use help(getHTML)
to see its documentation. It simply
takes a string as input, where that string is a URL. A URL always begins with http://
as in http://www.northwestern.edu
. The function returns a
2-tuple that contains the text of the web page and the URLs that appear on that
web page. The text of the web page is returned in all lower case letters. This
is done in order to avoid counting uppercase and lowercase spellings as
different words. Please use this function as-is.
Note that this library will only return pages at
Northwestern University (anything at northwestern.edu, etc). We put this
restriction on the code because people who run web sites might get annoyed if
they suddenly had dozens of Northwestern programs disrespectfully crawling
their website (there are restrictions about how programs can
"respectfully" crawl web sites which we are not going to get into
here. Furthermore, bugs in a program almost always render them
"disrespectful" :)).
For this milestone, you can ignore the URLs that are
returned. You will use those in an interesting way on the final version of the
project.
There are two other ingredients that will be useful. First,
you will need to manipulate strings in all sorts of ways. To that end, use
Python's many built-in string methods. While we've seen some string handling
methods in class, the string
sections in the Python
library reference is a great place to learn about Python's string handling
methods. These resources will help with parsing, cleaning the input, and even
stemming the words.
Second, you will need to sort. Python has a built-in
method for sorting lists. If you have a list called foo
then the command foo.sort()
will sort that list in increasing order. It
can sort lists of strings, lists of numbers, and even lists of lists and lists
tuples! When sorting a list of lists (or tuples), it sorts first according to
the first element in each list or tuple, breaking ties by looking at the next
element, etc. Experiment with this!
Be sure to test each function as you write it. Writing
all of the functions and only then testing the whole program is a recipe for
headaches.
You may want to test your program on some very small
web pages. Here are three web pages that you may want to use to test your
program:
http://networks.cs.northwestern.edu/EECS110-s18/projects/project2/page1.htm
http://networks.cs.northwestern.edu/EECS110-s18/projects/project2/page2.htm
http://networks.cs.northwestern.edu/EECS110-s18/projects/project2/page3.htm
What To Submit for the first Milestone
By Sunday, 06/03 at 11:59 PM
you should submit your first milestone. This includes:
pr2_milestone.txt
that tells us
which project you are submitting and your name (and your partner's name if
it is a joint submission). pr2_milestone.py
. hmc_urllib.py
).? For example, pr2_milestone_sup1.py,
pr2_milestone_sup2.py, etc.The
Final Project
The final project adds the following features to the
basic program and is due on Sunday, 06/10 at 11:59 PM.
When a user enters a URL into your program, the program
grabs the text from that URL as before. However, the program then looks at all
of the web pages referred to by URLs on that web page. Recall that the function getHTML
that we have provided above returns this
list of URLs. This process of automatically going from one web page to other
web pages that are linked from that page is called "web crawling".
Web crawling can be very slow if the web page where you
begin your search has links to other pages that have links to other pages, etc.
Therefore, your program should have a global variable called DEPTH
that enforces that your program never
crawls more than that number of links away from the starting web page. For
example, if DEPTH
is
set to 0, then only the starting page is searched. If DEPTH
is set to 1 then only the starting page and
pages linked from that page are searched, etc. For debugging purposes, set DEPTH
equal to 0 initially and then increase this
number later.
So, your program will now begin at the start URL
provided by the user and crawl to all web pages reachable within DEPTH
"hops" of that web page. In the
end, it will display a single text cloud that aggregates all of the words found
during that "crawl".
You will need to be careful here. Imagine that you
start at a URL for a web page. Let's call that page X
. Imagine that X
has links to pages Y
and Z
. If the DEPTH
is one or more, your program will look at
pages Y
and Z
. Imagine now that Y
has links to A
and B
but also to X
and Z
. You need to make sure that Y
does not send us back to X
since we have already been there and presumably
we've already counted the words on that page. Counting them again would be a
mistake! Similarly, we need to make sure that Z
is
only visited once. If we're not careful, X
will visit Z
and also Y
will visit Z
, resulting in double-counting of the words
on page Z
.
How do we handle this? You may wish to use imperative
or recursive constructs. That's up to you. In order to avoid visiting a page
twice, you may want to keep a list of the URLs that have already been visited.
Then, before exploring a URL, check to see if it is in the list of already
visited URLs. If so, don't visit it again. Similarly, when a URL is explored
for the first time, we'll need to place it on that list!
You can test your code using the same test pages listed
above. First test from page1.html
with DEPTH
set to 0. This will give you a text cloud
only for the words on that page. Now try again with DEPTH
set to 1. This will give you a text cloud
for the words on page1.html
and page2.html
(because there is a link from page1.html
to page2.html
). How try again with DEPTH
set to 2.
Your final submission should have the following
features:
DEPTH
variable that
controls the depth of web crawling. Please have this set to 2 for your
final submission, although we'll change it's value when we test your code.
Making Your Text Clouds Web-Based
The last part of this project is to make your text cloud
program run on its own website. In other words, the user will now enter URLs on
a web page rather than from IDLE. The resulting text cloud will be displayed on
a web page as well. For example, you can take a look at a very basic example at
http://blue.cs.northwestern.edu/~emirhan/
Here's what you'll need to do to make this work:
Name your python program textcloud.py
.
Remove all
print statements from textcloud.py
. This is important for the
correct functionality of your program on the web.
Add a function to your program called mtcURL
that takes as input a URL
(a string) and returns (not
prints!) a string of all
of the words that you want to display in your textcloud. The code that we have
provided will call mtcURL
exactly once, so it is
imperative that mtcURL
return the totality of the
words accumulated during the web crawl. The string should be returned in the
following format: For each of the words in your text cloud (recall that there
are at most MAXWORDS
of them and they appear in sorted order from most frequent to least frequent)
there is a string of the form:
<abbr title = COUNT style =
"font-size:NUMBER%">WORD</abbr>
where COUNT
is the number of
occurrences of the WORD
and NUMBER
is the relative size of the font used to
display the WORD
. You can choose the NUMBER
to be anything that you like, but it should
be proportional to the number of occurrences (COUNT
) of
the WORD
. You may want to experiment
with these sizes to see what looks good to you.
For example, for the Harry Potter text cloud, the first
few lines might look like:
<abbr title =1267
style="font-size:800%">harry</abbr>
<abbr title =417
style="font-size:263%">ron</abbr>
<abbr title =335
style="font-size:211%">hagrid</abbr>
<abbr title =254
style="font-size:160%">hermione</abbr>
<abbr title =174
style="font-size:109%">down</abbr>
Notice that this tells us that the word
"Harry" appears 1267 times in the file and its font size is 800% of
the normal font size. The word "Ron" appeared 417 times and its font
size is 263% of the normal font size. The numbers 800 and 263 are not
important. What is important is that the ratio of these percentages (in this
case 800/263) is equal to the ratio of the word counts (in this case 1267/417).
This funny format is HTML! It will tell a web browser
the font size for each of these words. The nbsp's are spacers that put a bit of
separation between the words on the web page. Test your program to make sure
that it produces output in exactly
this format. Any mistake in the format will cause an error when you move on to
the next step.
IMPORTANT NOTE: Please thoroughly test your mtcURL
function in IDLE before moving on to the final steps below.
You can test it by giving it a url string (for simple test web pages such as
those that we've provided for you) and seeing if it correct output in the
format described above. Once you've done this testing, you can be quite sure
that you are almost done. However, failing to test this code now may result in
severe headaches on the very last part described below.
In order to be able to setup your own web pages you
will need a remote account on the Blue machine (blue.cs.northwestern.edu). In
order to get one you will e-mail emirhan@u.northwestern.edu
requesting a username and password.
On Windows or Mac machines, you may want to download SSH Secure Shell,
a program that allows you to move files between your machine and other machines
(e.g. the Blue machine). You can use this program to connect and login with
your provided username and password to blue.cs.northwestern.edu and you will see your remote files
through SSH. Note that, SSH is build-in protocol in Unix family laptops (e.g. Macs), but
you may need to enable it from your terminal. PuTTY is another secure protocol to connect to remote
servers and widely used for Windows machines. There are many more instructions online for both of them on how to install and run from your local machine (your laptop).
On the Blue machine, you have a home directory. This
is where you're logged in when you use Blue.
Within that directory, first create a folder named public_html. Any documents you put - with the correct
permissions - in the public_html folder will be publicly
viewable. There are four files needed to set up a website that uses Python.
They are
1.
index.html
2.
index.py
3.
Your textcloud.py file
4.
The hmc_urllib.py file
The instructions below explain how to download these
files, place them in the right directory, and set their permissions
appropriately. These
files must be installed on the Blue web server. To minimize
headaches, you may wish to do this during the scheduled lab time so that the
lab instructors and TAs can help you. However, you can also do it on your own,
even from your own computer.
Now that you are logged on to the Blue machine (by
connecting remotely as discussed above) here's what you will do:
Get the index.html
page. First open this this page. Then right-clicking or
control-clicking on the screen and click "view page source".
You will see the html text format of the page. Then copy all text and save it in a file as index.html onto your machine. Then, move
that file into your public_html directory on Blue. If you wish, you can open
the file for editing (for example, with IDLE or any other text editor).
Next, copy the text below into a file called index.py
. Then, save that file as index.py onto your machine. Then, move that file
into your public_html directory.
#!/usr/local/bin/python3
import cgi
import cgitb; cgitb.enable()
import textcloud
def htmlFormat( body = 'No text supplied', title = 'EECS 110 project
page' ):
""" takes the title and body
of your webpage (as a string)
and adds the html formatting, returning the
resulting
string. If you want to use some features, you
may have to
change this function (or not use it at
all...)
"""
startString = """\
Content-Type: text/html;
<html>
<head>
<title>
"""
afterTitle = """\
</title>
</head>
<body>
"""
afterBody = """\
</body>
</html>
"""
return startString + title + afterTitle +
body + afterBody
form = cgi.FieldStorage()
if 'inputurl' in form and len(form['inputurl'].value.strip())
> 0:
url = form['inputurl'].value
url.strip()
textcloudbody = textcloud.mtcURL( url )
else:
text = 'I don\'t know what\'s going on.'
textcloudbody = text
originalURL = "<h3><a
href=\"./index.html\">Back to text-cloud
creation</a></h3>\n"
htmlout = htmlFormat(textcloudbody)
print(htmlout)? # this renders the
page
Next, copy your textcloud.py
and the supplied hmc_urllib.py
file into your public_html directory.
Once that index.html file is in your
public_html folder, you need to set permissions to make it world-readable. Open
a terminal window. Type cd ~/public_html to
put yourself in the public_html directory. Then, type chmod ugo+rx * to change the access permissions of a file
so that outside users can use your files. (You can learn much, much
more about permissions by typing man
chmod)
Once everything is in place, make sure that you can see
your base page (which is the index.html one). It will be at
http://blue.cs.northwestern.edu/~yourusername
Go to that URL with a web browser and your program will
be running from the web! Congratulations!
You are welcome to add customizations and
embellishments to your project. For example, you might wish to add
colors to the fonts that are used to display the text clouds. The colors might
use to represent some additional information. Please explain any such added
features in the
textcloud.txt
file that you submit along
with your python program.
What to Submit for your Final Project
Here's what you should submit for your final project.
textcloud.txt
that tells us
your name, the name of your partner (if any). Please also explain and describe
any additional features that you have added. textcloud.py
.Submit your solutions at Canvas
.