Overview This purpose of this utility is to extract the complete sentences from the pages linked to by the results for a given web search. Inputs The user will input a search term or search phrase equivalent to what someone might typically type in to the search box on Google.com. The user will also input an integer for the maximum number of search results desired. Outputs The utility will output a list of complete sentences (defined below) from the pages returned in a list of search results.
## Deliverables
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
Specifications Executing the Search • The utility will accept the inputs from the user (a search phrase and a maximum number of results) and will execute a search using AWIS Web Search. • If the user's requested number of results is greater than the number of results supported by a single call of the AWIS API, multiple searches may be issued. Otherwise, the number of calls to the AWIS API should be kept to a minimum. • If the user's requested number of results is greater than the number of results that are returned by a call to the AWIS API, the utility should use all of the results that are returned and not continue to issue further searches for that search phrase. • If the user does not include a search phrase, the utility should not run at all. • If the user does not include a maximum number of results, the utility should run and use as a default the maximum number of results that the AWIS API can retun on a single call. The Amazon/Alexa documentation is unclear on this - that number may be 20 or 1000. Please let me know what the actual limit is as you are working on the code. Retrieving the Results Pages • For every link returned in the search(es), the utility should download the HTML for that link. For example, if "[login to view URL]" is one of the results returned in the search, the utility should download the HTML webpage at that URL. • If the page retrieval times out or fails for another reason for a given URL, it is acceptable to skip that URL and move on to the rest. It would be nice -- though not a requirement -- to have something in the utility that would notify the user if a lot of the pages are irretrievable, which might indicate a local internet connection problem, among other things. • For every HTML page that is downloaded, the utility should parse the HTML and extract the "complete sentences" (defined below) from the text of the page. Extracting the Complete Sentences • A "complete sentence" is defined as a string of plain text that starts with whitespace (space, tab, carriage return, etc.) followed by an upper-case letter (A-Z) or a number (0-9) and ends with a period (.), an exclamation mark (!), or a question mark (?) followed by whitespace. The middle of the sentence may additionally contain other characters including lower-case letters, other punctuation, etc. • A "complete sentence" must be at least 3 words long, where a word is a string of non-whitespace characters delimited by whitespace characters. • Prior to extracting the sentences, you may find it helpful to extract all HTML markup from the page. You may also choose to remove text that appears between square brackets ([]) or between curly braces ({}), since these constructs rarely contain actual English sentences • You may assume that this program will only be used for English-language pages. Storing the data • The utility should take all of the sentences found in all of the pages returned in the search results and concatenate them together with each sentence on a separate line. • There will be a single list of sentences for each search phrase entered. (In other words, there should not be a single list of sentences for each page downloaded. Instead the sentences from all of the pages for a given search should be concatenated together.) • If you choose to store this list of sentences in a Plone "Page" (preferred), the list should be stored in the "Body Text" field. You may use the user's search phrase as the "Title", which is the only other required field in the Plone "Page". • If you choose to store this list of sentences in a MySQL database (not preferred, but accepted), the list should be stored as a single table that includes, at a minimum, the list of sentences and the search phrase used. You may, at your discretion, add other helpful supporting information such as a timestamp of the search, or a unique identifier field.
## Platform
Technology • The ideal solution will be written in Python with the output stored in a "Page" content item within a Plone environment. • An acceptable (but less preferred) alternative to the Plone data storage format is a MySQL database. • Acceptable (but less preferred) alternatives to Python for the programming language are Perl and PHP. Solutions based on ASP, Java, .Net, and others will not be accepted. • Searches should be executed using the "Web Search" operation from the Alexa Web Information Service (AWIS) [login to view URL] Solutions based on the Google or Yahoo Search API's or solutions based on "scraping" search engine results pages will not be accepted. • You may use the method of your choice for implementing the AWIS in your preferred programming language. According to Amazon/Alexa, both SOAP and REST calls are supported.