Question: E-Mail scraper: Leting the URL pick the startURL.
Posted on: 12:00 am Sep 5 2010 By: Jesse Smith
In the E-Mail scraper, editing

$pageContent = $this->getContents($this->startURL);

let's you do it, BUT it keeps trying to get the email addresses over and over!

I changed the code to

$pageContent = $this->getContents($_GET['url']);

to try to let the URL, looking like

domain.com/implementation.php?url=http://www.domain.com

tell the script what startURL is. But it keeps trying to get the same E-mails over and over and over and.... (Tested on an html file with about 75 E-Mails and no links to other files.)
3 Replies
By Aziz on 8:21 pm Feb 25 2010:
Can you provide the test files (both the class/html)?
By Jesse Smith on 1:16 am Feb 26 2010:
http://www.cmgscc.com/mailscraper/ (I've merged both implementation.php and scraper.class.php in to one script, so now there's just scraper.class.php.)





http://www.cmgscc.com/mailscraper/scraper.class.php?url=http://www.domain.com



for trying to select the crawl URL. If you actually download it and run it, don't have a heart attack when you see the results!! I've made a lot of changes to it!!



emails.shtml has the test E-Mail addresses.



scraper.class.php.txt is both scripts combined, remove the .txt.



email.shtml is where the E-Mail addresses also go, for when it takes too long and I stop the script before it finishes and spits out the results.



email.php.txt empties the email.shtml file.
By Aziz on 10:52 pm Feb 26 2010:
You need to change the start path to the GET variable:



$new->setStartPath($url);



Revert back the following within the class:



$pageContent = $this->getContents($this->startURL);



Edited file included.

PHP Code:

Add new reply [new]

Reply

BBCode allowed - no URL/EMAIL

Attachments: [max 1 MB]

Please login before posting a question. This helps decrease my spam.

All rights reserved © Aziz S. Hussain 2009 - Contact Me - Privacy Policy - TOS