geocode scrubber(repost)(repost)

Анульовано Опубліковано %project.relative_time Оплачується при отриманні
Анульовано

I have a database of 78,000 addresses in Chicago, which I need geocoded to 100% accuracy (within 100 to 1000 feet, as defined below). Currently the addresses have already been geocoded from a combination of free sources. Currently the accuracy is around 90% or less, as 8,400 of the records fail a uniqueness test (e.g., different address = same coordinates, or no reverse-geocode match). I need a RAC coder to RESEARCH, IDENTIFY, DESIGN, DEVELOP, TEST, and RUN a set of heuristic algorithms to help find and resolve discrepancies. An "auto-manual" process (automatic detection + manual fixing) may be required for some of the records. The minimum required algorithms and procedures expected of the coder are: - Linear Street Modeling (e.g. try to curve-fit addresses on the same street to see if they are in a straight line). - Linear House Numbering System modeling (same as above, but use the numbering system, e.g. addresses on different streets but on same number could be on a straight line). - For the above linear models, coder needs to manually eliminate any streets which are not straight lines. - Complex curve-fitting (e.g. spline or neural net) for streets which are not straight. (NOTE: This is very rare in Chicago; I myself only know of 2 streets which are not on a straight line). - Proximity checks, e.g. different address resolving to the same or very nearby coordinates, or addresses on same zipcode resolving to very far away coordinates, usually means one or both are wrong. - Confidence estimator: Using the coordinates from multiple sources, plus the heuristics above, there should be an estimate of how accurate each record is. - Coder would manually check and fix any records that have a low confidence estimate. The current list of geocoded addresses is attached twice (both CSV and SQL formats). You may disregard the extra ID fields, these are for our internal process.

## Deliverables

**[The current posting is for 2 hours of analysis of the project described herein; if I like your analysis, I will then expand the posting to include time for actual implementation]**

* * *1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition, as follows? (depending on the nature? of the deliverables):

a)? For web sites or? other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.

b) For all others including desktop software or software the buyer intends to distribute: A software? installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.

3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).

* * *This broadcast message was sent to all bidders on Monday Sep 14, 2009 10:56:58 AM:

Attention all bidders: One coder has asked how I would verify if a lat/lon pair is correct... I would use Google Earth (satellite and street views) to check what structures exist at the given and neighboring coordinates, as well as at the given and neighboring addresses. Obviously I would not do this for every address. This would be done to spot-check the correctness of your algorithms.

* * *This broadcast message was sent to all bidders on Wednesday Sep 16, 2009 8:43:39 PM:

Attention all bidders: I have decided to redefine the ACCURACY REQUIREMENT for this project as follows to make it LESS ACCURATE and therefore much EASIER to implement. The coordinates for a given address will be considered accurate if they are 1) within 1,000 feet (304 meters) along the directional axis of the street, measured from the street entrance of the property identified by the given address, 2) within 100 feet (30 meters) of the street's nearest edge, measured perpendicularly, and 3) linearly coherent, meaning, the deltas between any two coordinates must always at least be in the correct directions along both axis of the street grid, even if the magnitude is wrong. In case of curved streets, similar requirements apply, subject to the appropriate geometric adjustments.

* * *This broadcast message was sent to all bidders on Friday Sep 18, 2009 10:47:36 AM:

I have just updated the ZIP file in the project posting area. Now it contains 5 tables: geo_address is the original customer address report table; geo_buildings is the result of USPS Address Normalization; geo_reference contains reverse-geocoding results; geo_master contains ALL the source geocoding results (google, geocoder, and vendor); geo_exceptions contains the results of a simple query to detect duplicates and missing info. The data is not fully complete yet, this will take another 8 days to obtain, I will post it here too.

* * *This broadcast message was sent to all bidders on Saturday Sep 19, 2009 3:59:02 PM:

Attention all bidders. The budget for this project is around $500 but I prefer to start lower in case some additional pieces are needed in order to integrate it into my system. Some bidders have asked if it's ok to use commercial GIS software which can do this work very easily. The answer is no unfortunately, because I need to re-run the solution on a daily basis, I need full source code and data. Also one bidder mentioned about TIGER/Line the free geo-database from the U.S. Government, this also is fine to use for this project.

* * *This broadcast message was sent to all bidders on Sunday Sep 20, 2009 3:09:23 AM:

The following query provides a quick view of the worst offenders: create temporary table t1 as SELECT count(*) as n, address, min(latitude) as min_lat, max(latitude) as max_lat, avg(latitude) as avg_lat, std(latitude) as std_lat, min(longitude) as min_lon, max(longitude) as max_lon, avg(longitude) as avg_lon, std(longitude) as std_lon FROM `geo_master` where latitude>0 group by address; select * from t1 order by (std_lat + std_lon) desc;

* * *This broadcast message was sent to all bidders on Sunday Sep 20, 2009 3:15:07 AM:

According to the above query, there are only 1,100 addresses where the coordinates have a combined Standard Deviation of over 0.001 across services. This shows most of the data is fairly consistent.

* * *This broadcast message was sent to all bidders on Sunday Sep 20, 2009 2:17:05 PM:

One bider has asked the following important question which is very important to clarify. He wrote { I don't know the names constructions in Chicago, so I can't recognize the "alternate spellings" ... you mean that "Alexander Ave","Alexander Cres","Alexander Ct", "Alexander Pl","Alexander St","Alexander Ter" are the same street? }. The answer is that I don't know either, without some research. Do the addresses on these different spellings fall on the same straight line? If you change the last token and resubmit the geo-code request to Google, does it return the same value? This is the kind of analysis I expect the coder to do on their own initiative in order to be able to clean-up this database.

* * *This broadcast message was sent to all bidders on Sunday Sep 20, 2009 3:35:20 PM:

I have posted a small sub-set of this project as a separate project. You may bid on this or the subset or both. The URL: [url removed, login to view]

* * *This broadcast message was sent to all bidders on Tuesday Sep 22, 2009 11:24:58 AM:

Here is a helpful piece of SQL for analysing this data:

SELECT * FROM `geo_address` NATURAL LEFT JOIN geo_buildings;

* * *This broadcast message was sent to all bidders on Thursday Oct 1, 2009 9:05:41 AM:

I have changed this project from fixed-bid to HOURLY PAY. Please re-post your bid. Thanks

## Platform

Any programming language used must be free (e.g. Java, Scala, Perl, C++, etc). Any services, software, or data used must be free also, without any restrictions that would prevent me from using the data (this is important, if you are not sure, please check before you start). I will not accept any deliverable which would be prohibited for me to use for free, or which I cannot open on my own computer.

Програмування на С Введення даних Техніка Геолокація Java MySQL Perl PHP Python Ruby on Rails Тестування / QA Юзабіліті-тестування Тестування сайтів

ID Проекту: #2921876

Про проект

Дистанційний проект Остання активність Nov 7, 2009