I need someone to write some software that will archive every listing posted on a particular website and use that information as described in the features section of this post.
Basic logic of program:
1. Send a request to a website that returns listings in xml format
2. Check each listing against a mysql database
3. Send a web request to each new listing individually to get all the information
4. Features 1,2,3 (Explained in detail below)
5. Upload images from the listings to amazon S3
6. Add the information for each listing to a mysql database
7. Sleep before looping back to step 1 (Read feature 4)
The website is limited to a 20 listings at a time (Step 1). If all new listings are found, keep sending web requests for the next page of listings until previous listings are found, so no listings are missed. (During peak times it is possible for more than 20 listings to be posted between the minimum sleep period of 2 minutes)
1. Create a table that tracks listings that are from the same user (by using two values found in the listing). Keep a tally of how many listings that user has posted and a tally of how many of those listings are unique (I suggest this is done on a separate thread as to not slow down the scraping).
2. If enabled, check each new listing's price against comparable listings on another website (web request to an api), and calculate the average value for comparable listings using the archive of listings in my database. Use some math calculations to decide if the listing is undervalued by a configurable amount/percent and send an alert (Amazon SNS and database entry). (This must be done on a separate thread as to not slow down the scraping)
3. Check each listing against search criteria, which can be configured by adding rows of criteria to a mysql database, and send an alert (Amazon SNS and database entry) if a new listing satisfies that criteria. (This will be simple criteria, such as if the listings price is >100, or if the listing is a specific model, etc). (This must be done on a separate thread as to not slow down the scraping)
4. Adjust the sleep time automatically as to minimize the amount of pages requested before finding previous listings (Explained in limitations). With a minimum sleep time of 2 minutes, a maximum of 15 minutes from 7AM - 11PM, and a maximum of 2 hours from 11PM-7AM, before looping.
5. Once daily check each active listing in the database against the website to see if the listing has been updated, or if the listing has been deleted. If it has been updated, save the changes to the database as a new row. If it has been deleted, change the status in the database so the listing will not be checked again. (I suggest this be a separate script ran by a cron job).
1. Must run on a linux server
2. Error Handling (Website down, website responds with unexpected data, etc)
3. Log activity/errors in a text file. Send an alert if errors occur (Amazon SNS and entry into database)
Program can be coded in any language that can run on a linux vps and take advantage of the multiple ip addresses the server has. PHP would be preferred.
12 freelancers are bidding on average $380 for this job
Hi, I have read the description & would like to discuss.. I have good web scraping experience & reviews. & can develop web scraping scripts in Python & C# Hope we can discuss details..
I have great expertise in web scraping in PHP. I have built up a personal library that lets me accomplish every request easily. I can handle sessions, proxies and avoid anti-scraping controls.
I have strong background in web scraping, api client development and similar things in php. I have developed various web site monitoring tools for big ISP in past.