Friday, September 27, 2013

Python + Hadoop: Real Python in Pig trunk:



For a long time, data scientists and engineers had to choose between leveraging the power of Hadoop and using Python’s amazing data science libraries (like NLTK, NumPy, and SciPy). It’s a painful decision, and one we thought should be eliminated.
So about a year ago, we solved this problem by extending Pig to work with CPython, allowing our users to take advantage of Hadoop with real Python (see our presentation here). To say Mortar users have loved that combination would be an understatement.
However, only Mortar users could use Pig and real Python together…until now.
As of this week, our work with Pig and CPython has now been committed into Apache Pig trunk. We’ve always been deeply dedicated to open source and have contributed as much as possible back to the community, so this is just one more example of that commitment.
Why is CPython support so exciting? To fully understand, you need to know a little bit about the previous options for writing Python code with Hadoop.
One common option is for people to use a Python-specific Hadoop framework like mrjob, Pydoop, or Dumbo. While these frameworks make it easy to write Python, you’re stuck writing low-level MapReduce jobs and thus miss out on most of Pig’s benefits as compared to MapReduce.
So what about Python in Pig? Before CPython support, you had two options: Jython User Defined Functions (UDFs) or Pig streaming.
Jython UDFs are really easy to write in Pig and work well for a lot of common use cases. Unfortunately, they also have a couple of limitations. For serious data science work in Python, you often want to turn to libraries like NLTK, NumPy, and SciPy. However, using Jython means that all of these libraries that rely on C implementations are out of reach and unusable. Jython also lags behind CPython in support for new Python features, so porting any of your existing Python code to Jython isn’t always a pleasant experience.
Streaming is a powerful and flexible tool that is Pig’s way of working with any external process. Unfortunately, streaming is difficult to use for all but the most trivial of Python scripts. It’s up to the user to write Python code to manage reading from the input stream and writing to the output stream, and that user needs to understand Pig’s serialization formats and write their own deserialization/serialization code in Python. Moreover, the serialization code lacks support for many common cases like data containing newline characters, parentheses, commas, etc. Errors in the Python code are hard to capture and send back to Pig, and even harder to diagnose and debug. It’s not a process for the faint of heart.
Looking at these alternatives, people who want to use Python and CPython libraries in Pig are stuck. But with CPython UDFs, users can leverage Pig to get the power and flexibility of streaming directly to a CPython process without the headaches associated with Pig streaming.
Here’s a quick example: Let’s say you want to use NLTK to find the 5 most common bigrams by place name in some Twitter data. Here’s how you can do that (using data from the Twitter gardenhose we provide as a public convenience):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
REGISTER ‘<python_file>’ USING streaming_python AS nltk_udfs;
tweets = LOAD 's3n://twitter-gardenhose-mortar/tweets'
'text: chararray, place:tuple(name:chararray)');
-- Group the tweets by place name and use a CPython UDF to find the top 5 bigrams
-- for each of these places.
bigrams_by_place = FOREACH (GROUP tweets BY GENERATE
group AS place:chararray,
COUNT(tweets) AS sample_size;
top_100_places = LIMIT (ORDER bigrams_by_place BY sample_size DESC) 100;
STORE top_100_places INTO '<your_output_path>';
view raw nltk.pig hosted with ❤ by GitHub
1 2 3 4 5 6 7 8 9 10 11 12
from pig_util import outputSchema
import nltk
def top_5_bigrams(tweets):
tokenized_tweets = [ nltk.tokenize.WhitespaceTokenizer().tokenize(t[0]) for t in tweets ]
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_documents(tokenized_tweets)
top_5 = finder.nbest(bgm.likelihood_ratio, 5)
return [ ("%s %s" % (s[0], s[1]),) for s in top_5 ]
view raw hosted with ❤ by GitHub
And that’s it.  You get to focus just on the logic you need, and streaming Python takes care of all the plumbing.
To run this yourself, you’ll need a Pig 0.12 build and a Hadoop cluster with Python and NLTK installed on it. If that’s too much hassle, you can run it locally with the Mortar framework or at scale on the Mortar platform for free.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.