Web Harvesting - Humanitarian Shelters
$30-5000 USD
Pagado a la entrega
IADDIC Shelters, LLC is a startup company looking to serve a portion of the worlds 1.4 billion houseless people.
We are interested in contracting to a provider who can deliver intelligence from a robust data mining expedition. As we research our global market we are finding it to be very disorganized and hope to harvest a wealth of information (not just data) through this process.
Scope of project: Preferred application, WebQL. Results files are to be delivered in csv and Microsoft Access formats and each table is to maintain host url as the index field. Additionally, we require each script to be delivered in a txt document
## Deliverables
We are seeking to capture the following:
Phase 1
UNIQUE Source url’s
Meta data from html from source page
<META name="description" content="Competitive Intelligence">
<META name="keywords" content="Competitive Intelligence">
Country of origin
And
Perform word analytics from the Meta data: keyword analytics results including:
All key words and frequency of use
Top 100 most frequently occurring words or phrases
Most frequent combination of key words using
2 words, 3 words, 5 words, 10 words
Deliverables:
Unique urls, description, keywords, in one csv file or Microsoft access database AND analytics results in separate csv and Microsoft databases (or tables within one)
Scripts used to harvest the data
Report search statistics: Number of pages scraped, number harvested, number Failed, number blocked.
Phase 2
IADDIC Shelters will review the results of the analytics provide an exclude from research listing.
Provider is to remove all source url’s with content matching the exclusion listing.
Deliverable: Sanitized urls, description and keyword database. (csv and Microsoft Access)
Refreshed Analytics csv and Microsoft Access database (or tables in one database)
All key words and frequency of use
Top 100 most frequently occurring words or phrases
Most frequent combination of key words using
2 words, 3 words, 5 words, 10 words
Deliverables:
Unique urls, description, keywords, in one csv file and Microsoft access database AND analytics results in separate csv and Microsoft databases (or tables within one)
Scripts used to create work product
Phase 3
Using the sanitized list of urls contractor is to identify if the web site contains:
Rss/xml feeds
Blog (s)
Database(s)
Events
Awards
Competitions
Exhibits
Deliverable: csv and Microsoft access database table containing host url, RSS/XML feed, Blog, and Database fields. (additional columns may be required if more than one of each field is identified)
It is further required to understand the principle business of the organization
The host url should be categorized in one of the following categories:
Manufacturer
Trade organization
Foundation
NGO (non governmental organization)
Non Profit
Consortium
Government
Bank
Association
Symposium
Research
Trade Shows
Additionally, We desire to know how many files are available on site of each of the following: PDF, XLS, DOC. PPT
Provide contact information from each host url
Company/Organization name
Mailing Address (convention by country of origin)
Phone contact (convention by country of origin)
Email addresses
Deliverables:
“Contains?? result files containing host url’s and fields identified above in csv and Microsoft Access database.
“Categorized?? result files containing host url’s and fields identified above in csv and Microsoft Access database.
“File types?? count of PDF, XLS, DOC, PPT and listing of each in a csv and Microsoft access database.
“Contacts?? result files containing url and contact fields identified above in csv and Microsoft Access database
Scripts used to generate each results
## Platform
Windoes XP. Office 2003, WEBQL (Optional)
Nº del proyecto: #3775482