hadoop - Do something to the entire Reducer values list based on one element -


I have an interesting problem that I'm struggling to fit into MapReduce. I have a bunch of log entries that everything I have to do is something like this:

Check if any entry for the IP is set to a specific flag. If this happens, apply changes to that all entries with that IP, otherwise do not convert.

The easiest way to do this would be to close the IP key, then repeat the values ​​once again to check if any flag is set, and to change it again (if needed). ). Unfortunately, it seems that I can only repeat once again on the Iterable which is being passed in the broadcast.

I see that the possible solutions are:

  1. In the reducer, I am reading on the disk to sort those values ​​so that I have to repeat the second time I can deserialize very much in this, it seems like a hack.
  2. Run some jobs that already make a list of IPs to transform, and store it in Hibbes or something, it clearly requires HB, and a lot of network communication.

    I want to stick with standard map modification, so that Amazon Lösicht can be easily installed on the map. I think there should be a way to do this through chained jobs, but I should not see anything. Does anyone have any suggestions on how I can do this?

    One possibility: your mapper can produce a compound key in which the IP address and the presence of this particular flag Both include. Then you need to make sure that the records that you broadcast again, are sorted, such records where the flag = true first appears. Since these records appear first, you will know that all records in IP address group have to implement their transformations.

    Here is a blog posting that will do this:



Comments