Arale, a Java web spider

Arale can download entire web sites or specific resources from the web. Arale can also render dynamic sites to static pages. I wrote this utility in 2001 to familiarize myself with the java.net.* package. I’m not actively maintaining it anymore, the code is rather messy but the spider is working fine.

Areas of interest

  • Web development
  • Advanced web browsing

Features

  • Download and scan user-defined file types.
  • Rename dynamic resources. Encode query strings into filenames.
  • Set the number of simultaneous connections.
  • Options for minimum and maximum file size.
  • Domain depth support.

While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers. Some real life cases are:

  • downloading only images, videos, mp3 or zip files from a site.
  • manuals, articles, ebooks fragmented in many files to discourage download.
  • user-unfriendly sites. Popups, banners and tricky scripts annoying you before you can download a resource.

Multithreaded means that Arale can download more than one file simultaneously. Arale can easily saturate your bandwidth, thus providing the fastest possible download speed for your internet connection.

If you’re developing dynamic sites using technologies such as JSP, PHP, ASP or whatever, you may be interested in rendering dynamic pages to static files.
Arale supports URL renaming: query string is encoded in the static filename and .html extension is appended. let’s make an example:

  • original URL: mypage.jsp!myparam=myvalue.html
  • static filename: mypage.jsp!myparam=myvalue.html

Existing links to renamed URLs are substituted with modified links. This preserves navigation among static files. Once a dynamic site is trasformed into a set of static files it can be deployed on a server that does not support dynamic pages. For example you may deploy a JSP site in a free web space.

Currently Arale is a command-line tool. It would be nice to develop a GUI for it. I’d like to have some feedback from users, so if you think it’s worth send me an email and tell me what you think. ;)

Requirements

Download

  1. Greg Kellum says:

    I’ve been using arale for quite some time now. It’s a really nice and easy way to capture an entire website. I’ve used it to capture documentation from websites when this documentation isn’t available offline, to capture sound samples from websites of how to pronounce foreign words, and so on… It comes in handy quite often…

  2. roy says:

    Hi
    can i use Arale for downloading web pages for offline viewing for example to download
    news.bbc.co.uk to local disk

    how to install in Linux?when i download arale.zip there is no executable file

  3. cyleft says:

    I’ve used Arale for a few years now, very useful, and easy to use. Great work.

  4. cyleft says:

    one more thing, the url was broken the last time I tried to download this, the one that worked for me was: http://flavio.tordini.org/download/arale.zip

  5. guangfei du says:

    I’ve bean using arale for a long time,it is simple,but is powerful.Great work!

  6. Barry LaLone says:

    Is there a document that explains what the settings in the properties files do?

  7. alxqa says:

    is there any instruction for how to useing arale?

  8. crone says:

    I had to modify arale.bat so it would run on my machine.

    if not “%JAVA_HOME%” == “” goto gotJavaHome
    ->SET JAVA_HOME=”C:\Program Files\Java\jre6″
    ->goto gotJavaHome
    echo You must set JAVA_HOME to point at your Java Development Kit installation

    echo Using ARALE_HOME: %ARALE_HOME%

    is a line inserted and <- is a line removed.

  9. GALVIN says:

    help pls, how do i run Arale on windox vista? I added add the homes files to java and arale like this ……

    if not “%JAVA_HOME%” == “” goto gotJavaHome
    set JAVA_HOME=”C:\Program Files\Java\jre6″
    echo You must set JAVA_HOME to point at your Java Development Kit installation
    goto cleanup
    :gotJavaHome

    if not “%ARALE_HOME%” == “” goto gotAraleHome
    echo ARALE_HOME is not set. Setting ARALE_HOME to current directory
    set ARALE_HOME=”C:\Users\Desktop\arale”
    :gotAraleHome

    ….and it still wont run

  10. study says:

    Thanks!

  11. robert says:

    thanks for your great gift, I will try it soon!

  12. Ram says:

    I tried using the arale webcrawler. But while executing the application from Eclipse IDE, I get the error during the connection establishment. Please let me know if you had come across such issue adn if so the fix for it.

    ERROR:
    Using properties: arale.properties
    globalTokens: [.htm, .shtml, .php, .jsp, .jpg, .asp]
    Setting log file to arale.log
    Output directory: D:\….output
    [Thread-0] Thread-0 start
    [Thread-0] connecting to http:
    Exception in thread “Thread-0” java.lang.IllegalArgumentException: protocol = http host = null
    at sun.net.spi.DefaultProxySelector.select(DefaultProxySelector.java:146)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:724)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654)
    at org.flaviotordini.arale.AraleUtilities.getValidConnection(AraleUtilities.java:25)
    at org.flaviotordini.arale.AraleThread.process(AraleThread.java:86)
    at org.flaviotordini.arale.AraleThread.run(AraleThread.java:42)
    at java.lang.Thread.run(Thread.java:619)

  13. bachelor says:

    hello,I try it and encounter the OutofMemory Exception