Arale, a Java web spider
Arale can download entire web sites or specific resources from the web. Arale can also render dynamic sites to static pages. I wrote this utility in 2001 to familiarize myself with the java.net.*
package. I’m not actively maintaining it anymore, the code is rather messy but the spider is working fine.
Areas of interest
- Web development
- Advanced web browsing
Features
- Download and scan user-defined file types.
- Rename dynamic resources. Encode query strings into filenames.
- Set the number of simultaneous connections.
- Options for minimum and maximum file size.
- Domain depth support.
While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers. Some real life cases are:
- downloading only images, videos, mp3 or zip files from a site.
- manuals, articles, ebooks fragmented in many files to discourage download.
- user-unfriendly sites. Popups, banners and tricky scripts annoying you before you can download a resource.
Multithreaded means that Arale can download more than one file simultaneously. Arale can easily saturate your bandwidth, thus providing the fastest possible download speed for your internet connection.
If you’re developing dynamic sites using technologies such as JSP, PHP, ASP or whatever, you may be interested in rendering dynamic pages to static files.
Arale supports URL renaming: query string is encoded in the static filename and .html extension is appended. let’s make an example:
- original URL:
mypage.jsp!myparam=myvalue.html
- static filename:
mypage.jsp!myparam=myvalue.html
Existing links to renamed URLs are substituted with modified links. This preserves navigation among static files. Once a dynamic site is trasformed into a set of static files it can be deployed on a server that does not support dynamic pages. For example you may deploy a JSP site in a free web space.
Currently Arale is a command-line tool. It would be nice to develop a GUI for it. I’d like to have some feedback from users, so if you think it’s worth send me an email and tell me what you think. ;)
I’ve been using arale for quite some time now. It’s a really nice and easy way to capture an entire website. I’ve used it to capture documentation from websites when this documentation isn’t available offline, to capture sound samples from websites of how to pronounce foreign words, and so on… It comes in handy quite often…
Hi
can i use Arale for downloading web pages for offline viewing for example to download
news.bbc.co.uk to local disk
how to install in Linux?when i download arale.zip there is no executable file
I’ve used Arale for a few years now, very useful, and easy to use. Great work.
one more thing, the url was broken the last time I tried to download this, the one that worked for me was: http://flavio.tordini.org/download/arale.zip
I’ve bean using arale for a long time,it is simple,but is powerful.Great work!
Is there a document that explains what the settings in the properties files do?
is there any instruction for how to useing arale?
I had to modify arale.bat so it would run on my machine.
…
if not “%JAVA_HOME%” == “” goto gotJavaHome
->SET JAVA_HOME=”C:\Program Files\Java\jre6″
->goto gotJavaHome
echo You must set JAVA_HOME to point at your Java Development Kit installation
…
echo Using ARALE_HOME: %ARALE_HOME%
is a line inserted and <- is a line removed.
help pls, how do i run Arale on windox vista? I added add the homes files to java and arale like this ……
if not “%JAVA_HOME%” == “” goto gotJavaHome
set JAVA_HOME=”C:\Program Files\Java\jre6″
echo You must set JAVA_HOME to point at your Java Development Kit installation
goto cleanup
:gotJavaHome
if not “%ARALE_HOME%” == “” goto gotAraleHome
echo ARALE_HOME is not set. Setting ARALE_HOME to current directory
set ARALE_HOME=”C:\Users\Desktop\arale”
:gotAraleHome
….and it still wont run
[…] Arale, a Java web spider – […]
[…] Arale […]
Thanks!
thanks for your great gift, I will try it soon!
I tried using the arale webcrawler. But while executing the application from Eclipse IDE, I get the error during the connection establishment. Please let me know if you had come across such issue adn if so the fix for it.
ERROR:
Using properties: arale.properties
globalTokens: [.htm, .shtml, .php, .jsp, .jpg, .asp]
Setting log file to arale.log
Output directory: D:\….output
[Thread-0] Thread-0 start
[Thread-0] connecting to http:
Exception in thread “Thread-0” java.lang.IllegalArgumentException: protocol = http host = null
at sun.net.spi.DefaultProxySelector.select(DefaultProxySelector.java:146)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:724)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654)
at org.flaviotordini.arale.AraleUtilities.getValidConnection(AraleUtilities.java:25)
at org.flaviotordini.arale.AraleThread.process(AraleThread.java:86)
at org.flaviotordini.arale.AraleThread.run(AraleThread.java:42)
at java.lang.Thread.run(Thread.java:619)
hello,I try it and encounter the OutofMemory Exception