Posted on: 12:00 am Sep 5 2010 By: Jesse Smith
In the E-Mail scraper, editing
$pageContent = $this->getContents($this->startURL);
let's you do it, BUT it keeps trying to get the email addresses over and over!
I changed the code to
$pageContent = $this->getContents($_GET['url']);
to try to let the URL, looking like
domain.com/implementation.php?url=http://www.domain.com
tell the script what startURL is. But it keeps trying to get the same E-mails over and over and over and.... (Tested on an html file with about 75 E-Mails and no links to other files.)
By Aziz on 8:21 pm Feb 25 2010:
Can you provide the test files (both the class/html)?
By Jesse Smith on 1:16 am Feb 26 2010:
http://www.cmgscc.com/mailscraper/ (I've merged both implementation.php and scraper.class.php in to one script, so now there's just scraper.class.php.)
http://www.cmgscc.com/mailscraper/scraper.class.php?url=http://www.domain.com
for trying to select the crawl URL. If you actually download it and run it, don't have a heart attack when you see the results!! I've made a lot of changes to it!!
emails.shtml has the test E-Mail addresses.
scraper.class.php.txt is both scripts combined, remove the .txt.
email.shtml is where the E-Mail addresses also go, for when it takes too long and I stop the script before it finishes and spits out the results.
email.php.txt empties the email.shtml file.
By Aziz on 10:52 pm Feb 26 2010:
You need to change the start path to the GET variable:
$new->setStartPath($url);
Revert back the following within the class:
$pageContent = $this->getContents($this->startURL);
Edited file included.
PHP Code: