Crawling a website for data, converting data to standardized datafile

Виконано Опубліковано %project.relative_time Оплачується при отриманні
Виконано Оплачується при отриманні

A website is presenting some numerical data. I want a crawler to extract and transform that data into a standardized data file(s), preferably in Excel format, but if necessary, in CSV or other similar format.

## Deliverables

{Please note, this request is related to, but not the same as my other work request # 1188594. Those who may have bid on that project may bid on this one, and bidders on this project may also want to try to bid on the other one.}

The website [login to view URL] has a bunch of numerical information on over 80,000 Facebook applications that I want extracted and transformed into numerical format. The final deliverable will be a data file, preferably a single file, but multiple if necessary, that has all of that data. I prefer the file(s) to be in Excel format, but can also be in CSV or other nonproprietary data format.

You can obtain a standard login username/password from the site for free. Using that standard login, you would crawl all of the apps, starting from the following anchor page:

[login to view URL]

Note, I have been informed that there is an unfortunate pagination bug in the website which you can see here: Note that while the anchor page claims to display apps 1-25, it does not actually display apps 1-25. Hence, the spider cannot simply click on "Next" for doing so would actually mean skipping some apps. {If you click on Next manually, you will see what I mean). Furthermore, the number of apps that is skipped seems to be unpredictable and hence you cannot simply crawl using a fixed increment value within the search query. Hence, the spider should be programmed to be smart enough to see the number of apps that were displayed and then construct the proper query to display the true next set of apps, without skipping any.

For example, if apps 1-17 are actually displayed (as opposed to apps 1-25 that the site claims to display), then the next query could be:

[login to view URL]

Basically, you would append the string ?0=x where x = the number of the last application in the previous search page. Or, if you have a better idea, then feel free to use it. What is important is that the crawler not skip any apps. Again, if that is not clear, then playing with the site should clarify the matter.

*The Final Output Data

I want all data fields, and importantly, all the information from the Javascript graphs that the crawler can see. For example let's consider the Top Friends apps:

[login to view URL]

With the free standard login, you will see that information in the Summary, Reach, and Audience Profile tabs are available (the info in the Engagement and Growth tabs will be grayed out).

From the Summary tab, I want the data regarding:

By Company Name (for example, RockYou, Slide, etc.)

Rank

DAU

Social Graph Influence

MAU

Categories

Description

The entire Unique Active Users graphs for daily, weekly, and monthly (where x=date, y= UAU) - note, while the graph is Adobe Flash, all of the data is viewable in the Page Source

From the Reach tab, I want

DAU

MAU

The entire UAU graph (just like above)

From the Audience Profile tab, I want:

Male/Female

Average Age

Average Number of Friends

Gender

App User Overlap (all of the fields)

App User Affinity (all of the fields)

Age (all of the categories in the histogram)

Social Graph Influence(all of the categories in the histogram)

Note, some of the data will be repetitive. I don't care - I just want to make sure that the data is complete, even if some of it is repetitive.

Important: many of the apps won't have all of these tabs or all of the fields. If the crawler can't find a tab or field for a particular app, it should just input a "-" string into the data file.

Техніка MySQL PHP Управління проектом Архітектура ПЗ Тестування ПЗ

ID Проекту: #2798068

Про проект

14 заявок(-ки) Дистанційний проект Остання активність Jul 18, 2009

Доручено:

khalidsafwatvw

See private message.

$127.5 USD за 14 дні(-в)
(64 відгуків(-и))
5.5

14 фрілансерів(-и) готові виконати цю роботу у середньому за $243

hwanghendra

See private message.

$191.25 USD за 14 дні(-в)
(468 відгуків(и))
7.5
wangpretty

See private message.

$127.5 USD за 14 дні(-в)
(27 відгуків(и))
5.3
rxhector2k5

See private message.

$85 USD за 14 дні(-в)
(67 відгуків(и))
5.0
Robotapps

See private message.

$65.45 USD за 14 дні(-в)
(91 відгуків(и))
5.4
codelabsl

See private message.

$552.5 USD за 14 дні(-в)
(58 відгуків(и))
5.0
anurag7vw

See private message.

$59.5 USD за 14 дні(-в)
(63 відгуків(и))
4.9
jvavadiya

See private message.

$102 USD за 14 дні(-в)
(19 відгуків(и))
4.1
webconsultantvw

See private message.

$51 USD за 14 дні(-в)
(4 відгуків(и))
2.7
liaisonsolu

See private message.

$170 USD за 14 дні(-в)
(3 відгуків(и))
1.3
rkvermavw

See private message.

$425 USD за 14 дні(-в)
(0 відгуків(и))
0.0
rizwanofuk

See private message.

$850 USD за 14 дні(-в)
(0 відгуків(и))
0.0
noumanhanif

See private message.

$170 USD за 14 дні(-в)
(0 відгуків(и))
0.0
g33kwesley

See private message.

$425 USD за 14 дні(-в)
(0 відгуків(и))
0.0