Find Jobs
Hire Freelancers

language web crawler

$100-500 USD

Completed
Posted over 20 years ago

$100-500 USD

Paid on delivery
We want to crawl the web to get: 1)? lists of the words used in different? languages on the web, and 2) a count of the number of times each word is found in each language UNTIL WE HAVE A STATISTICALLY SIGNIFICANT SAMPLE. Maybe 1000 pages of each language? We do not have a list of URLs we want to use. All that matters is that we do not count the same page twice. Other than that, ANY 1000 pages of each language will be fine. I imagine that the program will crawl pages by charset, CHECK to be sure the page is the "correct" language (per the charset tag) by comparing the simplest words in that language (see CHECK below), count the words on the page, note which page it is so it does not get counted again, and move on. CHECK Because charset tags are not alway reliable, we would pick 20 (or so) common words that are unique (and really common) to each language. E.G. an English example: the, an, in, are, is, and, to, on, this, a, by, that, were, have, been, will, a, of ...and then look for a meaningful subset of them to appear on a page before deciding what language it is. Obviously, we would test the search mechanism "by hand" first to be sure it worked in each language.) Note: I will identify the "check" words for each language, and be accordingly be responsible for the quality of this language filter. The? app will place the words and count into an Excel spreadsheet. (one sheet per language). As an example, after using this tool in English (and sorting by frequency within Excel) there would be? VERY long list, with a number next to it (indicating how many times it was found) like: the? 9,323,343 of? ? 9,028,282 and 9,003,939 a? ? ? ? 8,757,232 etc.... The languages of interest are: Afrikaans, Arabik,? Bulgarian, Catalan, Pinyin (Chinese), Croatian, Czeck, Dutch, English, Estonian, Finnish, French, German, Greek, English, German, French, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Norwegian, Polish, Portugese, Romanian, Serbian, Slovak, Slovenian, Spanish, Swahili,? Swedish, Tagalog, Thai, Turkish, Urkranian and Vietnamese. ## Deliverables 1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request. 3) Exclusive and complete copyrights to all work purchased. (No GPL, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site). ## Platform We are running Windows 2000, IE 6, and Excel 2002.
Project ID: 3000944

About the project

5 proposals
Remote project
Active 20 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
Awarded to:
User Avatar
See private message.
$85 USD in 25 days
5.0 (644 reviews)
7.9
7.9
5 freelancers are bidding on average $357 USD for this job
User Avatar
See private message.
$425 USD in 25 days
5.0 (103 reviews)
7.0
7.0
User Avatar
See private message.
$425 USD in 25 days
5.0 (10 reviews)
4.7
4.7
User Avatar
See private message.
$425 USD in 25 days
2.4 (6 reviews)
3.7
3.7
User Avatar
See private message.
$425 USD in 25 days
0.0 (0 reviews)
0.0
0.0

About the client

Flag of UNITED STATES
United States
5.0
9
Member since Nov 2, 2003

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.