Shockingly inefficient PERL script on Google n-gram
$30-250 USD
Terminado
Publicado hace más de 10 años
$30-250 USD
Pagado a la entrega
Some colleagues developed a Perl script that compares the similarity of two sentences using Google n-grams. The n-gram files are huge, and without knowing Perl, we believe they have done nothing to optimize retrieval from the n-gram files. Each sentence comparison now takes an average of 7 minutes, and since we have about 500,000 sentence pairs to compare, this task would take almost 7 years to run. We need the speed improved by two orders of magnitude, to an average of 4.2 seconds per comparison.
We suspect a simple initial indexing of the n-gram files to at the start of the process may take care of the problem. It would be ok for the system to take up to an hour at the startup to do any indexing and storing in memory. Up to 20GB of memory may be used to store the indexed data.
My name is Elias Hamaz, a Perl Coder based in London UK.
I can load the ngram files into a reference tree, so that the query is done on RAM memory.
I can then modify the code to query the tree.
My initial assessment is that:
1: The 20GB limit means that a file can be in memory only while its data is being queried.
2: A maximum of 2 files will be in memory at one time.
3: The order of the list of comparisons can be optimised so that queries on a particular file are performed sequentially, so as to minimise the number of disk read operations.
Please get in touch to discuss the details of the comparison process.
Regards,
Elias Hamaz
$164 USD en 1 día
5,0 (1 comentario)
2,7
2,7
10 freelancers están ofertando un promedio de $181 USD por este trabajo
Definitely an interesting issue, I'd be glad to take the challenge and work on it :) Thank you. Is it a Linux system you're working on?
(PS. Good that you aren't in Tom-Sawyer- mood right now: you'd reverse the bid, to reward the job to the bidder offering most :) )
I'm interested in that project. I'm experienced (15+) perl developer and linux administrator. The bid is just for 2-3 hours of work, it may or may not be enough to solve the problem. Cannot guarantee without seeing the code. regards.
I am new to freelancer but having extensive experience working on Perl.
I have executed lot of automation/optimization project in Perl with employer.
I want to understand your full requirement and will provide you with my approach, If you are satisfied then only you can give me this project.
I will assure you to meet your expectation.
I have extensive knowledge of Perl and of creating indexed data structures to allow for efficient data comparisons/manipulations; based on the project description, I propose using a nested hash structure to first load the n-gram data (actual implementation details depend on your data files, such as your "n-" number and how many files are being used) before reading in your sentences for comparison.
Provided sample data (n-gram files and comparison sentence input files) and your output requirements, I am confident I can deliver an efficient solution to help you achieve your goal in a timely manner.
I look forward to discuss this in detail at your earliest convenience.
I have 4 years of experience in unix, perl
I can modify the perl script.
My bid is low only to gain experience in freelancer.com , not because I am inefficient.
If you send the perl script I can tell actually long it takes to modify the script.
You pay only if the end result is satisfying.
Thanks,
Santhanalekshmi