Find Jobs
Hire Freelancers

Build a Large Scale Web Crawler System -- 2

$1500-3000 CAD

Cerrado
Publicado hace más de 7 años

$1500-3000 CAD

Pagado a la entrega
Large Scale Crawler Looking for a developer (or company) to build a robust web crawler system. There are approximately 20,000+ websites that we want to crawl and extract data from. We want to be able to extract these data within 3-6 months. 1. Design the architecture of the crawler or use existing open source crawler as a template. Because we’re dealing with large volume of data the architecture needs to be: • Robust and scalable • Efficient and Fast • Support proxies (to bypass anti-scraping systems) 2. Create Admin dashboard where Admin can: a. Add, Edit, View, Delete, Stop, Search crawler b. Input the URL to crawl c. Specify the data that needs to be extracted (ie. Title, Title URL, etc.) d. View, Edit, and Delete extracted data e. Option to download the data in JSON, XML, CSV f. API of the data (either via Authorization Tokens or other means) for upload and integration h. Users Management with ACL (Access Control List), Create, Edit, View, Delete users 3. Data normalization and clean up. The data coming in are unformatted and unstructured; an example would be the location or city, some site list location or city as Houston, TX, while other list as Houston, Texas or USA-TX-Houston. Therefore, the location or city data needs to be formatted, we use Google Location. 4. Because the data changes daily on these 20,000+ websites, there needs to be notifications put in place to notify the system of the changes (ie. what’s been added and what’s been removed) and update the data automatically. 5. Once the data is verified and cleansed, it will be available for search either via Solr or ElasticSearch or any other recommendation. Some of the technical challenges that need to be addressed from the beginning: • Make sure that the crawler compresses the data before fetching it otherwise it will uses a huge amount of storage • No need to re-crawl a website every 1-2 days, because it would be a waste of resources, however we do want the data every 1-2 days • Ways to prevent crawler from DoS (Denial of Service) • Ways to prevent the system from crashing and overloading because there are so many crawlers running • System should be scalable to handle crawling 100,000 – 200,000 websites • Queuing: does the crawler start right away or does it run in batches at a certain time? How does it scale when we start adding more sites to crawl? Example Day 1: Admin adds 100 sites to crawl Day 2: Admin adds 200 sites to crawl Day 3: Admin adds 500 sites to crawl Day 4: etc.
ID del proyecto: 11964255

Información sobre el proyecto

18 propuestas
Proyecto remoto
Activo hace 7 años

¿Buscas ganar dinero?

Beneficios de presentar ofertas en Freelancer

Fija tu plazo y presupuesto
Cobra por tu trabajo
Describe tu propuesta
Es gratis registrarse y presentar ofertas en los trabajos
18 freelancers están ofertando un promedio de $2.740 CAD por este trabajo
Avatar del usuario
Hello, We have a team of Skilled Java-J2EE professionals with experience up to 8 years. ===== Our Expertise in Java / J2EE : * Desktop Applications : Swing, Eclipse Rich Client Platform, AWT, SWT, RMI * Frameworks: Spring, Spring Security, Spring Social, Struts, Hibernate ,JPA, Lucerne, Quartz, Ant, jUnit, DbUnit, Mybatis * Web Technology : JSP, JSTL, JSF, JQuery, Ajax, JavaScript, DWR, FCK Editor, Extjs * Application Servers : JBoss, WebLogic, WebSphere, Apache Tomcat, Glassfish * Databases : MySQL 4.x/5.x, Oracle 8i/9i/10g/11g, Postgre SQL * Web Services : SOAP, WSDL, RESTFUL Web Services, Apache Axis * IDE : Eclipse, Net Beans, Web Ratio (Model Based Application Development IDE) , Spring IDE * Payment Gateway : PayPal Integration [Experienced in integrating other payment gateways too] * Project Management : SDLC , AGILE ===== We are available from Monday To Friday, 9 hours a day. Our timezone is GMT+5.30. Please initiate chat to check our understandings and queries. You will be able to communicate directly with the expert working on your project. We look forward to have long term engagement on the basis of quality of our work evinced in this project. Thanks.
$3.000 CAD en 30 días
4,9 (42 comentarios)
7,6
7,6
Avatar del usuario
Hi, this is Anshuman. I have 6 yrs of experience in scraping, crawling, processing and mining data. My previous projects include- [1.] Automated Crawling of Google for SEO Keywords [2.] E-Commerce Crawling. Crawling websites like Amazon, Ebay,Alibaba etc. [3.] E-mail list scraping and phone number scraping for targeted users [4.] Scraping Data from within Android Apps [5.] Dynamic Data crawling through JS Manipulation [6.] Automated Form Filling and Scraping [7.] Proxy Emulation and Authentication in order to prevent server blocking [8.] Mobile Site emulation and crawling mobile site specifically [9.] Scraping data from Desktop Apps, PDFs etc. [10.] Artificial Intelligence to emulate human behaviour while crawling and scraping sites Experience [Programing languages]- Python, PHP, NodeJS, Jquery and Rails. [Frameworks]- Python Scrapy, Apache Nutch, Selenium, DOM Manipulation using Chrome Extensions, URLLib2, Python Requests, PHP Syphony etc. Data Can be exported to- Excel Files, MySQL, MongoDB, CouchDB, Cassendra, Redis, Docx. File, Amazon s3, HDFS, Oracle, MSSQL etc.
$3.000 CAD en 50 días
4,6 (17 comentarios)
6,4
6,4
Avatar del usuario
Dear Client, Greetings from Flowgica technologies, I have experience with these skills. We do have similar experience doing something similar to yours therefore I am looking forward to discuss and move ahead. please check our freelancer portfolio at https://www.freelancer.com/u/mmadi.html?page=portfolio I am ready to work with you,kindly waiting for your response. Thanks & Regards, Mmadi
$1.800 CAD en 36 días
5,0 (6 comentarios)
6,0
6,0
Avatar del usuario
A proposal has not yet been provided
$2.777 CAD en 90 días
4,9 (66 comentarios)
5,1
5,1
Avatar del usuario
Hello, I have read what you exactly need, however I would like to ask you a few questions. I do work smart and do not rest until I get the job done. Please feel free to ping me anytime so we can have a detailed discussion and finalize our budget and timeline. I will deliver in best possible way. Thank you.
$2.500 CAD en 30 días
5,0 (6 comentarios)
4,8
4,8
Avatar del usuario
Hello. 30 % of employers hiring me once hired me again. I have experience in the same. I CAN do this job, and do it well!
$1.500 CAD en 30 días
5,0 (9 comentarios)
4,0
4,0
Avatar del usuario
Hi, I can develop that robust web crawler system. Please contact me for more details and samples of my work. I'm an expert web developer, with over 10 years of experience in PHP, WordPress, HTML5, PostCSS, CSS Modules, LESS, NodeJS, AngularJS, ElasticSearch, ReactJS, Gulp, AWS, Webpack, MongoDB, Socket.io. Best regards, Dmitry, Miami
$3.500 CAD en 30 días
5,0 (4 comentarios)
3,5
3,5
Avatar del usuario
I am an IITK graduate, 9 year experienced software professional and I have got top notch developers in my team, who have got experience across a span of technologies. The members in my team have worked with top notch tech organization such as Amazon, Cisco, Oracle etc. We have been involved in similar projects in the past and our track record has been excellent.
$2.500 CAD en 30 días
3,5 (18 comentarios)
5,1
5,1
Avatar del usuario
My name is Mike and I’m from UK. I work with individual clients and also provide outsourcing services for a number of UK and USA based agencies. Your project description sounds interesting to me and I do have skills & experience that are required to complete this project. I can show you some examples of my work. Please contact me to discuss your project.
$2.500 CAD en 30 días
5,0 (1 comentario)
3,2
3,2
Avatar del usuario
Hello, This is a big project. I will describe you shortly what I plan to do but will give you more ideeas in a chat talk or a skype call. Firstly I will create a platform that uses master - slave relationship. a master with big storage and database management, maybe database in clusters / replication / sharding depending on the stored data and slaves for crawling. the slaves will be deployed on yhe go using an api from your cloud provider or one of my preffered cloud provider. a big problem will be the proxy list. public proxy lists change a lot and this will require maintanance. Also, for crawlers not perform DDoS on websites will use a dynamic capping solution that will assure this will not happen. I already have a lot of ideeas for this project and will be my pleasure to share with you. Dan.
$5.555 CAD en 60 días
5,0 (7 comentarios)
3,1
3,1
Avatar del usuario
Dear Sir, Hope you are doing well, I have read your job description, I am willing to work with you. I have already done similar job and win 5 star rating with wonderful review. Key Responsibilities are : - I ll complete all of your requirements - I ll do more tweaks for you as well Sir i ensure you that i am best suit for this post. Please open chat with me so we can discuss more in details Looking forward to hear from you. Best Regards, Waheed Gondal
$2.468 CAD en 30 días
0,0 (0 comentarios)
0,0
0,0

Sobre este cliente

Bandera de CANADA
Canada
0,0
0
Miembro desde feb 23, 2015

Verificación del cliente

¡Gracias! Te hemos enviado un enlace para reclamar tu crédito gratuito.
Algo salió mal al enviar tu correo electrónico. Por favor, intenta de nuevo.
Usuarios registrados Total de empleos publicados
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Cargando visualización previa
Permiso concedido para Geolocalización.
Tu sesión de acceso ha expirado y has sido desconectado. Por favor, inica sesión nuevamente.