Docstoc

Endreport of Karel van Os

Document Sample
Endreport of Karel van Os Powered By Docstoc
					Endreport




Name: Karel van Os
Studnr: 1515779
Date: jan-2009
Table of Contents
Preface                       3
Internship Work               4
  Research and Startup        4
  Distributing the Scraper    6
  Availability Checker        9
  Graph drawer               10
  Crawler                    20
Term summary:                25
Reflection:                  27




                              2
Preface
My name is Karel van Os. I study Computer Sience at the Hogeschool Utrecht, I‟m in
the third year. I am currently working as a trainee at Premie Adviseurs. I work here on
the BookMover project. The goal of the project is to create a program with which you
can easily make analysis on the (antique) book market.

I‟m writing this report because my school wants to know what I did and learned on
the internship. This document is also meant for the people who will on this project
after me. So that they understand how my programs work and can maintain it.

There are three people working on this project: Jasm Sison(My project leader),
Jeroen Braak(another intern like me) and me. In the begin phase we all did the same
work so the results written there are found by all of us working together. In the rest of
my work I‟ll say when one of the other two was helping me.

I‟m doing this report in English, because this company also takes foreigners as
interns. And all documentation we made during the internship was done in English
too.

The self reflection part will be done in Dutch, because it‟s not really relevant to
anyone but me and school.




                                                                                        3
Internship Work

Research and Startup
When I started here there were 2 programs:
- A (web)Crawler
- A Scraper
Unfortunately there was no written documentation about them. So the first couple of
weeks we were busy trying to get everything working and documenting how you can
set everything up.

After doing that we had to test everything and brainstorm on a lot of things, because
we didn‟t know what still needed to be done, how well the 2 products were working
and what else we had to make to reach the goal of the project.

Results:
- We found out that the database on one of the two servers was old and
   apparently not used anymore
- Both programs write to the same database
- The Crawler uses a beta version of Heritrix which has a memory leak. And no
   working interface
- The web interface for the crawler, created by the previous group, consists of jsp-
   pages which need to be converted to a war file using maven.
- To use the crawler web interface you need to deploy the war-file in jetty.
- To make the crawler crawl a webpage you need to create a complicated heritrix
   job.
- The crawler can‟t crawl more than 3 jobs at a time.
- The scraper is written in java and uses hibernate to connect to the database
- The scraper uses xml pattern files to determine what should be scraped out of a
   webpage and what not.
- Prices for books were updated when the scraped tried to scrape an url that was
   already scraped once before.

 Things that needed our attention:
 - We needed to optimize the database models. There were a lot of unnecessary
     things in it like a book_id in the webpage table and a webpage_id in the book
     table.
 - Their database was slowly filling with webpage‟s which actually weren‟t used
     anymore. After the scraper scrapes the necessary data out of the body of the
     webpage they become useless. But we couldn‟t throw them away, because they
     had a foreign key to the book that was scraped out of it.
 - The server‟s harddisk was only 30 gigabyte and was almost full. This is because
     our huge database. This can be reduced a lot by throwing away all processed
     webpage‟s, but that would only postpone the problem so we needed a new hard
     disk.
 - Both programs could only be run from one machine. Nothing was distributed.
 - The scraper‟s pattern files weren‟t complete and sometimes scraped wrong data
     or no data.



                                                                                        4
 -   There were a lot of books and authors that had nearly the same name but not
     precisely the same. So they weren‟t recognized as the same while they should.
 -   Books in the database were never marked as gone or sold
 -   There are unstructured webpage where second hand books are sold. We have
     no way of scraping those.
 -   A lot of the written code was messy and unreadable with weird function names,
     functions that weren‟t used anymore, functions that were to long and stuff that
     was coded twice.


 Solutions:
 - Improve the Scraper by making it distributed and making the pattern files better
 - Improve the database. Making webpage‟s delete able.
 - Create a program that checks if books are still available
 - Create a clustering program that clusters book titles, authors, publishers and
     illustrators
 - Create a program that is able to scrape unstructured webpage‟s.
 - See if we can fix the memory leak in the crawler and else create it from scratch.

An option for clustering is using dynamic programming. Jasm wanted us to do a little
test and make an example for future use. He wanted us to calculate the longest
common subsequence from to given strings. In my test I used GCCCTAGCG and
GCGCAATG. This gave the following result:




This shows that the longest common subsequence is five characters long. That is the
result we wanted. It was optional to write a trace back. I tried it but mine takes some
time and if the 2 strings you gave are long and have a lot of sequences it will result in
an out of memory error. The trace back should be rewritten and should use a
recursive algorithm.




                                                                                        5
Improving the Scraper
Problem:
At the moment scraper can only scrape one webpage at the time. If you run two of
them, there is a very high chance that webpage‟s will be scraped twice.
The current pattern-files need to be looked at too because they are incomplete.
Scraper needs to be adjusted to the new database models.

Assignment:
Things that need to be done:
 - Create a scraper that can process multiple webpage‟s at a time. (Done by me)
 - Improve the Pattern-files. (Done by Jeroen)
 - Implement the database changes into the scraper(Done by Jeroen)
 - Properly test everything (Done by me)
 - Document everything (Both)


Create a scraper that can process multiple webpage’s at a time:
We had two ways to do this: One was to make the scraper multi-threaded. The other
was to make a client, server application.

After discussing this for a bit we decided to go for the client, server application. A
multi-threaded one could put to much stress on a computer, with a client, server
application you can divide it over a lot of computers.

I‟ll be doing this using java, sockets and hibernate.
Java: because the already existing scraper was done in java too.
Sockets: because that is in my opinion the easiest way to do it.
Hibernate is used to connect to the database. The models for this already exist and
it‟s easy to use.

How the scraper works:
The old scraper got the oldest webpage from the database that was crawled. Made
the html-body, of the webpage, go through an xml parser. Than made the pattern-file,
that matched with the domain were the webpage was gotten from, go to through
another xml parser. And finally it matches them and filters out the right data.

Method of working:
I started with making a working socket connection example, which could send and
receive messages.

Over the socket connection I only need to send simple strings in which the first part is
always a number and the rest extra data. The first part is always a number because I
gave both the server and the client a Constants.java class which are exactly the
same. It contains only Integers, with names like WEBPAGEREQUEST and
NEWWEBPAGE, which all have a different value. This is done to make it easy to
identify what kind of message is being send.

After that I needed to make the client side be implemented into the scraper. Making
the client connect to the server on startup and not directly start scraping.


                                                                                         6
Next I had to move the part where the webpage was selected from the database from
the client to the server and make it so that when the client send a
WEBPAGEREQUEST the server queried the database and reply with a
NEWWEBPAGE <web id>. When doing this I also needed to make sure that the
webpage, that was gotten, wasn‟t already being scraped by another client. Luckily
that wasn‟t so with HQL, because you can add Lists to a query and make sure that
the object you are querying for isn‟t in that list.

Lastly I needed to make the scraper client process the webpage with the given
webpage id and make it send another WEBPAGEREQUEST after finishing it.

I now had a working server client application where multiple clients could connect to
my server.

Next all I had to do was refactor, test and optimize.

Scraper server:
The scraper server I created has a couple of features:
 - You can decide what port he should listen too in the config-file
 - You can decide the maximum of clients that can connect.
 - He has a webpage buffer of twice the size of the maximum clients that can
    connect.
 - The webpage-buffer only has webpage‟s with the state „unprocessed‟
 - A client can connect to the server and send messages.
 - When a client sends a WEBPAGEREQUEST, the server sends a simple
    NEWWEBPAGE with the new webpage id from the buffer.
 - The server won‟t send a webpage that is already being scraped by another
    scraper to a different scraper.
 - When a scraper requests a new page or gives a socket error. The webpage it
    was scraping will be released so that, when the state for some reason still
    „unprocessed‟ is, it can be scraped again.

Scraper client:
Client scraper features:
 - When starting up it will try to connect to the server specified in the config-file
 - If it fails to connect it will try again until the connection attempts, in the config-
     file, is reached.
 - If it connects success full it will send a WEBPAGEREQUEST to the server and
     start listening for input from the server.
 - When the client receives a NEWWEBPAGE it will get the id out of the line and
     query the database for the webpage with that id.
 - After it gets the webpage it will go through the xml parser just like the old parser.
 - When it finishes scraping a webpage it will send a new WEBPAGEREQUEST to
     the server.
 - If the server disconnects it will finish scraping and then try to reconnect.




                                                                                        7
Testing/fixing/optimizing:
By this time Jeroen was done with his part so I could start testing.

Because of the size of the program, the simplest way to test is to just make it process
as many different webpage‟s as possible.

Things I encountered:
First thing I pretty soon encountered was that the scraper wasn‟t ready to scrape
languages from webpage‟s. This made the whole application crash.We got this
problem because the improved pattern-files. The code did exist but it was still
incomplete. The database expected an enum with either eng,fra or nld. Our scraper
extracted English from webpage‟s which he tried to write to the database. This was
easy to fix by changing the enum to a varchar.

Second thing I encountered were some deadlocks in my server. This was easy to fix
by making the functions synchronized. That is java‟s way of making sure that the
parts you made synchronized can‟t be called by different thread simultaneous.

Next problem I encountered was another database problem. The title of a book
couldn‟t be longer than 125 characters. After googling for the longest book title, I saw
that the longest book title at the moment is 3999 characters. So I made the database
take 5000 characters. This was also crashing the whole application.

After some more testing the scraper got an error because it tried to parse and empty
string to a publication year. There should have been a check for that but there wasn't
one so I just added one.

The scraper could now run for days without crashing. But then we got problems with
our database. The max number of connections was reached for some strange
reason. This was partly because of hibernate and partly faulty code. In the hibernate
config-file was an attribute min_connections. I‟ve got no idea why it was on a
minimum of 15 connections, but it didn‟t seem to affect the scraper when we changed
it to 0. Also in the code sessions weren‟t closed properly at some points. That was
fixed by closing them. Just to be safe I changed the connection timeout in hibernate
to 5 seconds so that this wouldn‟t happen again.

After running the scraper for 3 days we suddenly got an out of memory. This was
because with the changes I made the set with the pattern-files kept growing. I fixed it
by adding a check, to see if the set already contained the pattern-file when adding it.

By randomly checking pages that were scraped I noticed that not everywhere the
ISBN was written to the database. I first checked if the pattern-files were correct, but
they weren‟t the cause of it. It was in the code. When it gets ISBN or isbn10 the
length of the ISBN had to be 10 and when it gets ISBN 13 the length had to be 13,
but in those pages the ISBN was skipped because it had 13 characters. After finding
that out it wasn‟t hard to fix. I just added an extra check for an ISBN with 13
characters.




                                                                                           8
Availability Checker
Assignment:
Create a program that can check whether a book is still being sold on the webpage
where it was scraped from. This is a prototype product. (This product is needed for
making the next product because it needs to know when books are available)

Method of Working:
I started with making a book_availability table in the database. This table contains 4
columns: id, begin date, end date, book_id.

Id, begin date, book_id will get filled, when a book gets scraped for the first time. To
make this happen I simply had to adjust the scraper and make him add a new row,
when he writes a new book to the database. Begin date gets filled with the date when
he got crawled not the date when he gets scraped.

I will download the webpage from which the book was scraped originally and re-
scrape it. When downloading the page fails or when the scraped title is different from
the books current title, the end date will be filled.

Getting the webpage was easy. I used the URL java class for this. You can give this
class an url. If you try to open an inputstream and you get a FileNotFoundException it
means that the page does not exist anymore. When that happens I filled the
database with and end date.

When opening the inputstream to the webpage doesn‟t fail, I make it go through a
mini scraper. I got most of the code from the scraper we already have. The scraper
uses a ParserDelegator this java class can directly take the inputstream and with the
classes, methods and Pattern-files I‟ve taken from the scraper, it can directly get all
info out of it.

When the checker has re-scraped the webpage he simply compares the newly
scraped title with the book title he‟s checking. When they aren‟t the same or the title
failed to be scraped from the page I fill the end date column with the checked date.

Finally I added a new column with last_checked_date to the book table in the
database. To decide which book I‟m going to check I select the book with the oldest
last_checked_date. When the book is checked i update the value and change it to
the current date. This means the program will never stop checking books. Until of
course none of books are available anymore.
Flaws:
The availability checker has a couple of flaws:
 - When just a field on the webpage changes from „available‟ to „sold‟ it won‟t be
    noticed by the checker. I checked all webpage‟s we are scraping none of them
    has such a field. Most of them get a webpage with this book is no longer
    available. Some just get a 404 html error resulting in a FileNotFoundException.
 - The checker makes extra connections to a web domain on top of the crawler.
    This could lead to a ban on the domain. So when running the checker you have
    to do it on another ip as the crawler is running or just make sure the crawler is
    off.


                                                                                          9
Graph drawer
Assignment:
The assignment was to make a program that could graphically show information over
our books in the form of charts.

Planning:
Since there was only a brief feature list, I had to think of the other features. The
planning has three phases. In the first phase I would have to work on the brief feature
list and at the end of it we would have a meeting about it and discuss more features.
The same was done in the second and third phase.

Techniques used:
Jython, hibernate, jfreechart and SWT.
We went from using Java to jython, because Jasm wanted to try it out and had
reason to believe it we would be able to program faster if we used it. It would go
easier because the syntaxes are easier and you have to type less code for most
things.
SWT was used, because it‟s faster and easy to work with

Method of working:
Jasm and I started with making a rough sketch of how the GUI layout would be and
what functions it would get.




In the picture above you can see what our initial thoughts where.

I started with making the 5 overviews. Jeroen and I made the overviews together to
get familiar with Jython. Every overview has a table with information in it and all


                                                                                      10
overviews are in a tabfolder so you can easy switch between overviews. Our first
thoughts were to show them all in one page. This proved impossible, because we ran
out of memory and it took ages to fill the table.
I fixed this by only showing 100 items per page and adding navigation buttons to go
through the pages easily.

Next Jasm wanted to be the table sort able. Because we didn‟t only want to sort the
100 items your viewing but all. The tables get sorted by changing the hibernate query
and adding „order by <columns you‟re sorting on>‟ and than refill the table with the
items you get.

Because we have so many items in our overviews a search functions were
necessary.




                                                                                   11
In our overviews you can also right click items. When you click on a book you get:




If you press Show Authors you will go to the Authors overview and show the authors
belonging to that book. The same goes for show Book Sellers, Publishers and
Illustrators. Goto Webpage will open your default web browser and open the page on
which it was scraped from. Info will open an info chart.

In the other overviews you get:



Show Books will make you go to the book overview and will only display the books
from the item you clicked. Info will also open an info chart (see legend info)

With the table options button you can choose what information you want to show. In
the picture below you can see what options you get.(made by Jeroen)




With the reset table you can reset the table to the default values.
The overviews get filled with items from a query. When you sort on a column order by
gets added to it. Hibernate has a function to set from which result you want to fetch
and how many you want to fetch. So if you on page 5 its setFirstResult(500) and
setMaxResults(100). This made making the navigation buttons really easy. Doing it
this way also made the navigation through searched pages easy.

Next was making the charts. This is done by using JFreechart. JFreechart is an easy
and free to use library, it‟s registered under the LGPL. The only downside of it is that
it‟s done using swing and since I‟m using SWT it didn‟t go well together in the
beginning, but I got the hang of it after a while. With jfreechart you can make around
20 different kind of charts. The chart I used are: Timeseries chart for the line chart, a
Bar chart for the normal bar chart and a Gantt chart for the avaialble bar chart.




                                                                                       12
Timeseries chart:




When drawing a single book it gets all prices for that book, turns them into the
currency you want and draws them. You can choose which currency you want to
draw them in by going to options > Line Chart Options. You‟ll get the following frame:




                                                                                    13
There you can choose in what currency you want to draw you lines in and whether or
not you want to chart currency market changes. With Chart Currency Market changes
points between the price points will be drawn too. When a price is drawn in the chart
and the currency to be drawn in is different than his own currency I first have to
calculate it to euro and than to the currency chosen. This is because our currency
updater (which Jeroen made) scrapes a page that contains the values compared to
the euro. When calculating to a different currency it takes the currency value left and
right in time, than calculates what the value would be, if a straight line would be
drawn between them, at the time from the price.

When drawing the average price of an author, publisher, bookseller or illustrator you
are faced with a problem. None of the prices for the books are scraped on the same
time and none of them are a simple straight line. So how do you calculate the
average? I worked around this by getting 10 snapshots, from when a book from him
was first scraped until now, and calculate what the average price would be on those
10 points in time. Because the books from an author can be sold in different
currencies you have to set the options to something else than their own currency and
the char currency market changes option will be ignored.

The line chart also has a zooming function. If you want to zoom in on a specific
period of time, you can select that at the button of the chart and press zoom.




                                                                                    14
Bar chart:




In the bar chart you can see how many books of a unique title there were available
over a specified period of time.

When you create a new bar chart, the first thing you‟ll see is:




Here you select the time period over which you want to check.

Now if you drag a book, author, illustrator, publisher or bookseller into the chart, it will
get all unique titles and draw how many you there were during the period you
selected.




                                                                                          15
Available chart:




The available chart is done using a Gantt chart. That was the easiest. I first tried
making a normal bar chart horizontal. That didn‟t work because it also turned the
axes. You can add a task with a begin and an end date to a Gantt chart and than it
will draw the horizontal bar.

When you drag a book, author, illustrator, publisher or bookseller to the chart, it will
take all individual books and draw from when until when it was available. If a book
has no end date yet, the end date will be today.

At the bottom of the chart you have a zooming function again. Fill in a begin, end
date and press zoom and the chart will zoom in on the period of time specified.

Legend:




Every chart has a legend. In the legend you see the items which you have drawn.
When you right click on one you get two options: remove and info. If you press
remove the line/bar will be removed from the chart. When you press info, info, about
the item you clicked, will be shown in a new tab item. This is the same as when you
clicked info in one of the overviews.




                                                                                           16
This is how the final product looks like:




Feature list:
It can show overviews of all:
 - Authors
 - Books


                                            17
 -   Booksellers
 -   Publishers
 -   Illustrators

You can adjust which columns you want to see in the overview, by pressing options.

You can search you can search in every separate overview for:
- Author names
- Publisher names
- Book titles
- Bookseller names
- Illustrator names

You can make 3 different charts:
- Line Charts
- Horizontal Bar Charts
- Vertical Bar Charts

We can make a line chart over time of:
- Book prices
- Average book price of an author, bookseller, publisher and illustrator.

You can make a vertical bar chart over time of:
- The number of books with a unique title from an author, bookseller, publisher or
   illustrator.
- The number of books from a unique title.

You can flip through the overview pages with an easy to use navigation buttons.
Every page shows 500 books.

You can drag a row from any off the overviews to one of the charts you created.

Line chart:
- When you drag a row from the book overview to the line chart, a single line will
    be drawn. A line containing points of crawl time set out against the price it was at
    that time.
- When you drag a row out one of the other overviews to the line chart, you will get
    a choice. You either draw the average book price of the author, or you can draw
    all books of that author.

Vertical bar chart:
- When you create a new Vertical bar chart you will first have to choose a time
    period
- When you drag a book on the chart, it will draw how many books of that title were
    available.
- If you drag an author, illustrator, publisher or bookseller to the chart, it will take all
    unique book titles from it and make bars with how many books, with those book
    titles, there were during the time period you‟ve chosen.




                                                                                         18
Horizontal bar chart:
- When you drag a book to the chart, it will draw from when till when it was
    available.
- If you drag an author, illustrator, publisher or bookseller to the chart, it will get all
    books and draw how long they were available.

You can choose in what currency you want to plot lines and whether or not you want
to plot the currency changes between the prices.

All charts have a legend which you can hide. You can right click the items in the
legend and choose if you want to delete it from the chart or show extra information
about it.

Known bugs:
-   When you dragged an object with a lot of books to a chart and make them plot.
    The progress bar sometimes will stop showing if you click on something.
    Everything will become white. This doesn‟t mean the program crashed it‟s still
    plotting.
-   When you sort on price you get a book in yen at the top.

Todo:
Because we never got to the second testing round it‟s not complete yet. Some more
things could be implemented like:
- Options for the bar charts
- Better way for drawing the average price of authors etc
- When sorting on price in the book overview put the book which is the most worth
    at the top. We tried this once but the query took way to much time to complete
    you might have to added a column to the db table with euro value or something.


Problems I encountered:
-   First problem I encountered was with combining swt and swing. This is fixed by
    using the AWT_SWT object. It can make a frame which can hold the swing
    components.
-   Calculating the average price was difficult. I decided to take 10 points and my
    boss found that a good solution.
-   Filling the overviews caused out of memory errors. Fixed by making them
    navigation able.
-   JFreeChart has a build in legend separating that proved very difficult. It wasn‟t
    clear where the shape and color was stored after a lot of searching I finally found
    it.
-   Not all book titles and names are precisely the same some have a comma,
    others a dot, some names are misspelled on webpage‟s. This can‟t be fixed by
    me. It will be done in the clustering project.
-   The director not testing my product was a problem too. But it didn‟t really matter,
    because there were some other things that needed priority so instead of working
    on this until the end of my stage, I was going to write a new scraper.




                                                                                          19
Crawler
Assignment:
Create a crawler that crawls web domains and that can store webpage‟s we need.
The crawler server must be able to handle jobs with the following parameters:
    webpage-size-limit : int
          o This parameter defines the maximum size allowed for accepted
             webpages.
    crawl-roots : [url]
          o This parameter defines where a crawler should start scraping. There
             could be multiple starting points.
    max-crawl-depth : int
          o This parameter should prevent infinite crawl loops.
    max-crawl-breadth : int
          o This parameter should als prevent infinite crawl loops.
    crawl-domains : [url]
          o This parameter should prevent or allow certain out-of-bounds scraping
             of certain domains.
    accept-url-patterns : [regexp]
          o If a certain crawled page contains a url according to a url-pattern then it
             is stored.
    download-delay : int (millisec)
          o This parameter defines how long a crawler client is to wait between
             downloads.
    random-download-delay : int (millisec)
          o This random parameter value is added to download-delay.
    request-job-delay : int (millisec)
          o This defines how long a crawler is to wait to request a new job when it
             is idle.
    random-request-job-delay : int (millisec)
          o Idem, but random, also it's added to request-job-delay if set.
    comply-robot-rules : boolean
          o There is a robots.txt at the '/' path of a domain. Comply or not.
    inspect-pages : [url]
          o This parameter contains the urls that are to be checked whether they
             exist or not.
And a crawler client that actually crawls the job.

Techniques used:
I will make the new crawler in python. For the connection between the server and
client I will use sockets. The programming environment will be the pydev plugin for
eclipse. The connection with the database is done by sqlalchemy, which works
similar to how hibernate works.

Method of working:
Since I have been programming jython mastering python wasn‟t very hard. I started
with making a helloworld server client example using sockets in python. This wasn‟t
really very different that making sockets in java. Sending over objects however was.




                                                                                      20
In python you need to use pickle for that. Pickle enables you to dump and load an
object making it possible to send it over the socket connection.
First I had to evaluate whether or not we missed some job attributes and if we
needed all attributes we had. The following were unnecessary:
    o request-job-delay, random-request-job-delay. A delay when getting a job isn‟t
        necessary because one job takes about 2 days to complete.
    o comply-robot-rules. Most webpage‟s we want have robot.txt and in those
        robot.txt specified that the pages we want to crawl are denied. Ignoring
        robot.txt could result into a ban, but because we don‟t crawl the whole domain
        only specified pages it shouldn‟t matter that much.
    o inspect-pages. This is totally unnecessary because there are like thousands
        op pages that need to be checked on availability.
After looking at the Heritrix jobs decided to add:
    o database-username, database-password, database- host, database-path.
        These are necessary for the database connection
    o reject-url-patterns. For when you have one webpage with 2 slightly different
        links to it. And they came pasted the accept-url-pattern.

Next was changing the old heritrix jobs to the new formats. Basically this was copying
there regex they used and database connection.

Than I made an allUrl set and a urlsToCrawl queue and got the sending over a job
over the socket connection possible. Next I got all the html code from the root pages
and scraped all urls out using regex. I than check if the links are already in allUrls if
they are ignore the link and continue to the next if not I put it in and check it against
the crawl-url-pattern regex. If the url passes the crawl-url-pattern I put it in at the end
of the urlsToCrawl queue.

I added the crawl-url-pattern because some pages sell more than only books and to
make sure it scrapes only the pages you want you can add a regex for that. If you
leave the list empty it will just accept everything.

Next I just repeated with what I done with the root only I also check if the link im
scraping get passed the accept-url-patterns and not passed the reject-url-patterns. If
it gets past them it saves the html code of the webpage to the database.

Next it was the download delays turn. Because you don‟t want to much connections
in a to short time to a webpage, I build in a delay. If the delay minus the time it took to
crawl a webpage is smaller than zero the delay gets ignored. Note that it also takes
also time to download the webpage, this time is not calculated as time it took to crawl
a webpage. Your average time per webpage is, because of this, always bigger than
you download delay.

The scraper is now ready for testing. After making it run for a bit I found some
mistakes I made:
   o some links use a „ and others a “ fixed by changing the url scraping regex
   o some webpages used A HREF others a href fixed by making the url scraping
      regex ignore case
   o not all links are http:/www.somesite.com/somelink/bla.html there are also
      /somelink/bla.html and somelink/bla.html all doing different things. Fixed this


                                                                                         21
     by making the links complete when it has no domain and starts with a / it
     points to the root if it has no domain and doesn‟t start with a / it points to where
     the current pages is.
   o Lots of pages where scraped twice. Fixed by adding a remove-url-attributes to
     the job containing attributes in a link that can be removed with no effect to the
     page it visits.(Note don‟t be to reckless test test test when you add an
     attribute)

Jasm mentioned that it might happen that we try to crawl an image instead of html
code. That‟s why he wanted a webpage-type parameter in the job file, so that you
can specify what kind of content type you want the page to have. It matches it to the
content-type parameter in the header of a webpage. If it doesn‟t matches it ignores
the webpage.

I made it so that when server starts up it reads all files in the /jobs/ directory,
dynamically imports them, extracts the jobs from them and stores them in a queue.
When a client asks for a job it wil send the first item from the queue. When a clients
disconnects or stop it will put the job it was handling at the end of queue.

I could now run multiple crawlers at a time. Next day we got a lot of complaints from
the building saying the internet was slow.. If I run to many crawlers I take up all the
internet in the building. This limited my testing a bit.

Next I noticed I was getting timeout errors and html errors. So I decided to make to
add to the existing config-file, which contain the server address and port number, 2
paramers: downloadErrorTries, downloadErrorSleepTime. The error tries is how
many time you want to retry to download the page and the error sleeptime is how
long you want to wait between the tries.

Also added the option, whether you want the client to start a new job, when a job
finished, or not.


Final job attribute list:
webpage-size-limit
   o int
   o (-1 is size unlimited)
       This parameter defines the maximum size allowed for accepted webpages.
webpage-type
   o list with strings
   o (if you give a empty list all types are accepted) Url will only be scraped if there
       header Content-Type contains one of the parameters.
crawl-roots
   o list with urls
   o This parameter defines where a crawler should start scraping. There could be
       multiple starting points.
max-crawl-depth
   o int
   o (-1 is size unlimited)
       This parameter should prevent infinite crawl loops.


                                                                                         22
max-crawl-breadth
    o int
    o (-1 is size unlimited)
       This parameter should als prevent infinite crawl loops.
crawl-url-pattern
    o list with regex
    o New urls will only be scraped out of pages which url matches these regex.
       This is used to prevent the scraping of unnecesary pages. Which makes the
       crawler faster. Also used for sites that sell more than only books.(e.g.
       www.powell.com)
crawl-domains
    o list with regex
    o This parameter should prevent or allow certain out-of-bounds scraping of
       certain domains.
accept-url-patterns
    o list with regex
    o If a certain crawled page contains a url according to a url-pattern then it is
       stored.
reject-url-patterns
    o list with regex
    o If a url came through the accept-url-patterns it get checked against these
       regex. This is done because some urls had a tab=1 in it. The rest of the page
       was the same
remove-url-attributes
    o list with parameters
    o All parameters specified here are removed from an url. This is done to prevent
       pages being crawled twice. You might want to use this instead of reject-url-
       patters if the tab or other things the attribute makes visible doesnt contain any
       links to other pages.
download-delay
    o int in milisec
    o This parameter defines how long a crawler client is to wait between
       downloads.
random-download-delay
    o int in milisec
    o This random parameter value is added to download-delay.
database-username
    o string
    o Database username
database-password
    o string
    o Database password
database-host
    o string
    o Database host ip/adress
database-path
    o string
    o Database name




                                                                                      23
Todo:
-   Create a parameter allowed in the jobs.
          o This parameter is a dict with as value a list with parameters and as key
             a list with regex. If a link matches the regex it should remove all
             attributes except the ones in the value list.
          o E.g. http://www.abebooks.co.uk/servlet/SearchResults?bsi=20&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?bsi=40&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?bsi=60&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?bsi=80&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?n=100121501&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?fe=on&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?sgnd=on&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?dj=on&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?bi=h&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?bi=s&vci=3185100
          o   http://www.abebooks.co.uk/servlet/SearchResults?pics=on&vci=3185100
          o   The first 4 are allowed but the other 7 give an overview with the same
              items as the first couple now you could put n,fe,sgnd,dj,bi.pics in the
              remove-attribute list but an allow-parameter with bsi and vci and regex
              .*/SearchResult.* would be easier.

-   First thing is to create an error handler which keeps a log with errors.
           o Should probably be done on the server
-   When an url gives an error it could be handy to be able to find out where the link
    came from.
-   Make it possible to make a crawler continue where he previously was. For
    example when a crawler crashed or was just stopped.
           o The allUrls set has to be saved and the urlsToBeCrawler queue
-   Put a limit on how much bandwidth the crawler uses.
-   Combine the availability checker and the crawler.
           o When a job is finished do a query getting all books that haven‟t ended
               yet. From those subtract all pages that were just crawled and check the
               books the books that are left.
-   Create a priority system. Making it possible to give priorities in jobs.


Problems encountered:
One of the webdomains we crawl gives a different kind of structure when you don‟t
have a user agent when you access the webpage. Fixed by making it possible to
define a user agent in the config-file




                                                                                        24
Term summary:
Scraper:
Our scraper is a web scraper. A web scraper extracts data/information from
webpage‟s. This is called web scraping. Basically you give it html code and the
scraper extracts the data you want out of it.

Crawler:
The crawler we made is a web crawler or web robot. It browses the World Wide Web
automatically. This is called web crawling or web spidering. Many sites in particular
search engines use this to providing up-to-date data.

Java:
Java is a programming language developed by sun. Java programs are compiled to
bytecode and then run the JVM (Java Virtual Machine). Java programs can run on
most operating systems.

Python:
Python is a programming language developed by the Python Software Foundation.
It‟s a very dynamic language and is used for a lot of different applications, from
scripting to web servers.

Jython:
Jython is a combination between python and java. It uses the same syntax as python
but can also call java classes. Jython uses the JVM just like java.

Hibernate:
Hibernate is an ORM(object-relation mapping) for java. This basically means it
returns objects representing tables in the database which makes it easy to work with.
e.g. you want to make a new row in the database:
Book book = new Book()
Book.setTitle(“Harry Potter”)
Session.save(book)
This creates a new row in the book table with as title „Harry Potter‟

SQLAlchemy:
Like hibernate this is also an ORM only for python.

Heritrix:
Heritrix is an open source web crawler written in java. For more information visit there
homepage at http://crawler.archive.org/

SWT/Swing:
SWT(Standard Widget Toolkit) is a java GUI toolkit created by IBM and eclipse.
Swing is also a java GUI toolkit, created by Sun. According to the Eclipse FAQs,
"SWT and Swing are different tools that were built with different goals in mind. The
purpose of SWT is to provide a common API for accessing native widgets across a
spectrum of platforms. The primary design goals are high performance, native look
and feel, and deep platform integration. Swing, on the other hand, is designed to
allow for a highly customizable look and feel that is common across all platforms."



                                                                                       25
Eclipse:
We use eclipse classic as IDE(integrated development environment). It‟s free and
easy to use Java development software. It‟s an open source program, but is mainly
developed by the eclipse foundation.

Regex:
Regex is short for regular expressions. It‟s used to locate patterns in a string.




                                                                                    26
Reflection:

Ik wou graag een stage doen waar ik me verder kon verdiepen in programmeren en
dan niet websitejes maken, maar echt een programma maken. Dat is me goed
gelukt. Ik heb me verder kunnen verdiepen in java en daar allerlij nieuwe dingen mee
gedaan. Ook heb ik een nieuwe taal geleerd(Python) en geleerd hoe je met 2 talen
kunt samen werken(mbv Jython). Verder heb ik me verder kunnen verdiepen in
ORMs die het gebruik van databases een stuk makkelijker maken. Als laatst heb ik
geleerd efficient gebruik te maken van gratis libaries die je een hoop tijd besparen.

Mijn tweede leerdoel naast me verdiepen in programmeren was meer in verband
samenwerken in project verband. Op school krijg je wel een beetje projectmatig
samenwerken, maar dat stelt eigenlijk niet heel veel voor. Nu is me echt duidelijk
geworden dat vergaderen belangrijk is, concrete doelen stellen en afspraken
nakomen ook. We werkten in een project van 3 mensen en een keer hadden we een
prototype van een product af en de „baas‟ moest dat dan testen en zijn commentaar
er over geven, over dingen die ontbreken of dingen die niet goed werken. Het duurde
3 weken voordat hij het uit eindelijk getest had, terwijl we hadden afgesproken dat
het binnen 1 week getest zou worden. Dit zorde ervoor dat er niks meer van de
planning klopte. Gelukkig hadden we nog zat andere dingen te doen dus echte
vertraging hebben we er niet door opgelopen, maar het was toch best wel vervelend.

Ik heb ook geleerd dat het belangrijk is om je spullen goed te documenteren. Ten
eerste omdat alles werkend krijgen wat de groep voor ons gemaakt had ons 2 weken
koste. Daarnaast overlapte ons werk binnen ons project wel eens en dan bespaart
het ook een hoop tijd als er goed gedocumenteerd is hoe je iets gemaakt hebt of
waarom je het op die manier gedaan hebt. Ook tijdens het leren van Python was het
erg handig, omdat als je goed gedocumenteerd hebt je geen fouten maakt die je
mede project lid al gemaakt heeft.

In mijn vervolgstudie hoop ik meer programmeer talen te leren. Nu dat ik op mijn
stage een nieuwe geleerd hebt ben ik wel geintereseerd om er meer te leren.
Daarom ben ik vanplan om in mijn keuze semester Game Technologie te gaan doen,
omdat je daar c++ krijgt.




                                                                                   27

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:6/24/2011
language:English
pages:27