I often analyze sites with thousands of pages " per folder.
Large sites.
With nested directories.
Just to make it " you know " more complicated fun.
Although I might have a pretty good mental image of how the site looks, knowing how it looks and showing how it is structured are much better.
Heres how you can do that with Excel 2007 " and of course it works for small(er) sites too.
Get A List Of URLs
Crawl the site, download its XML sitemap, scrape Google: do anything to get as many URLs covering as much ground as you can.
Dont forget that you can import or paste almost anything into Excel and it will make (a lot of) sense of it.
For the purpose of this post Ive downloaded the sitemap of Allied American University.
Split Into Folders
Select the column with the URLs.
On the data tab, select Text-to-Columns.
This brings up the first screen of the convert wizard.
Make sure delimited is selected and click Next.
Select other as delimiter and enter the backslash. Click Finish.
Click OK.
Youll now several columns, some containing directories, some containing filenames.
Remove Filenames
Select a column and click Filter.
Select the text filter "> ends with.
In this example case were filtering filenames with .aspx
Select and delete the filtered results in that column.
Repeat for as many columns as you have.
Remove Duplicates
View #1: To get a view of the site structure, select all columns and click Remove Duplicates.
View #2: Or, to get a view just of the different levels/folders, select one column at a time and when you click Remove Duplicates, dont expand the selection: opt to continue with the current selection.
Click Remove Duplicates, confirm your selection and Excel leaves you with unique directory names only.
For view #1 youre done at this point. For view #2, repeat for as many columns as you have.
The Results.
View #1
A clear view on the site structure. You can apply various sorts or filters at this point.
View #2
Having sorted them alphabetically for cleaner presentation as well as having sized the columns for readability, the end result is a clean overview of the different levels of the site.
View #2 can give you a better idea of the many different levels there are to a site. Its a great view for large sites like directories as a lot of the permutations are removed.
View #1 on the other hand is more true to the actual site structure.