How to scrape content from Meetup.com using PHP

on in Blog
Last modified on

WordPress - Meetup.com Events using PHP

One of my recent projects involved creating WordPress posts from Meetup.com events. As the API is in beta state and the eligibility conditions are rather restrictive, an easy solution was to scrape content and cache it (to avoid blacklisting). The code is easy and it requires file_get_contents() and the DOMDocument library.

Creating posts (or custom post types) in WordPress is outside the scope of this tutorial, and I’ll only post the scraping part.

First of all, we need a scraper class for getting a multi-dimensional array of events for a given group. This is very fragile, as all HTML scrapers are. The document structure may change at any time.

Here’s the class, commented and explained:

<?php
class MeetupEvents {
    var $cachePath;
    var $cacheAge;
    var $meetupBase;
    var $eventsUri;

    /**
     * MeetupEvents constructor
     * 
     * @param string $cacheAge argument to pass to strtotime() - defaults to '-1 hour'
     * @param string $cachePath path to where cache files will be kept - defaults to '/tmp/'
     */
    public function __construct($cacheAge = null, $cachePath = null) {
        if ($cachePath != null){
            $this->cachePath = $cachePath;
        } else {
            $this->cachePath = '/tmp/';
        }
        if($cacheAge !== null){
            $this->cacheAge = strtotime($cacheAge);
        } else {
            $this->cacheAge = strtotime('-1 hour');
        }

        $this->meetupBase = 'https://www.meetup.com'; // no traling slash
        $this->eventsUri = '/events/'; // leading and trailing slash
    }

    /**
     * Accepts an RSS feed URL, an age of cache and a path to store cache, returns SimpleXMLElement from simplexml_load_string()
     * 
     * @param string $url
     * @param int $cacheAge in the form of epoch. to use 1 hour, do: strtotime('-1 hour')
     * @return string or boolean false
     */
    function getAndCacheUrl($url, $cacheAge, $cachePath) {
        $cacheFile = $cachePath . 'MeetupEvents_cache_' .  md5($url);

        if (!is_file($cacheFile) || filectime($cacheFile) < $cacheAge) {
            // todo - switch this to use cURL call and check for 200s and such
            $result = file_get_contents($url);
            if (is_writable($cachePath)) {
                $save_result = file_put_contents($cacheFile, serialize($result));
            } else {
                $save_result = false;
            }
            if ($result === false){
                error_log("getAndCacheUrl() can't fetch from $url - this is bad!");
            }
            if ($save_result === false) {
                error_log("getAndCacheUrl() can't write to $cacheFile - this is bad!");
            }
        } else {
            $fetched = file_get_contents($cacheFile);
            if ($fetched === false) {
                error_log("getAndCacheUrl() can't retrieve data from $cacheFile - this is bad!");
            } else {
                $result = unserialize($fetched);
            }
        }
        return $result;
    }

    /**
     * Get future events for a group
     * 
     * @param $group string group name to fetch events for
     * @return array $events with each member being an array with keys of link, title, epoch, human_date and description
     */
    function get_future_meetup_events($group) {
        $url = $this->meetupBase . '/' . $group . $this->eventsUri;
        return $this->get_meetup_events($group, $url);
    }


    /**
     * Get past events for a group
     * 
     * @param $group string group name to fetch events for
     * @return array $events with each member being an array with keys of link, title, epoch, human_date and description
     */
    function get_past_meetup_events($group) {
        $url = $this->meetupBase . '/' . $group . $this->eventsUri . 'past/';
        return $this->get_meetup_events($group, $url);
    }

    /**
     * Get past/future events in a multi-dimensional array. uses getAndCacheUrl()
     * 
     * @param $group string group name to use in URL
     * @param $url string URL as derived from past or future events functions
     * @return array
     */
    private function get_meetup_events($group, $url) {
        $meetupHtml = $this->getAndCacheUrl($url, $this->cacheAge, $this->cachePath);

        $events = array();
        $dom = new DOMDocument;
        libxml_use_internal_errors(true);
        $dom->loadHTML($meetupHtml);
        libxml_clear_errors();

        $finder = new DomXPath($dom);
        $classname = "eventCard";
        $nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");

        foreach ($nodes as $node) {
            $event = array();
            foreach ($node->getElementsByTagName('a') as $link) {
                if ($link->getAttribute('class') == 'eventCardHead--title') {
                    $event['link'] = $this->meetupBase . $link->getAttribute('href');
                    $event['title'] = $link->nodeValue;
                    break;
                }
            }
            foreach ($node->getElementsByTagName('time') as $link) {
                if ($link->getAttribute('datetime')) {
                    $event['epoch'] = $link->getAttribute('datetime');
                    $event['human_date'] = $link->nodeValue;
                }
            }
            foreach ($node->getElementsByTagName('div') as $link) {
                if (strstr($link->getAttribute('class'), 'text--strikethrough')) {
                    $event['status'] = 'cancelled';
                    break;
                } else {
                    $event['status'] = 'active';
                }
            }
            foreach ($node->getElementsByTagName('p') as $link) {
                if ($link->getAttribute('class') == 'text--small padding--top margin--halfBottom' &&
                    trim($link->getAttribute('class')) != '' &&
                    stristr($link->getAttribute('style'), "visibility:hidden") === false
                ) {
                    $event['description'] = $link->nodeValue;
                } else {
                    $event['description'] = null;
                }
            }
            $events[$event['epoch'] . '-' . rand(100000, 888888)] = $event;
        }
        ksort($events);
        return $events;
    }
}

And here is how to get Meetup.com events. The get_future_meetup_events() and get_past_meetup_events() functions return an event array with the following details:

Array
(
    [link] => https://www.meetup.com/wordpress-cambridge/events/281545076/
    [title] => WordPress Cambridge: My Favourite Plugin at Maypole & Zoom
    [epoch] => 1636398000000
    [human_date] => Mon, Nov 8, 2021, 7:00 PM GMT
    [status] => active
    [description] => Tonight we're meeting at the Maypole and on Zoom. We'll demonstrate, share and discuss our favourite plugins. Please bring login details for a site on which you used the plugin, so our Zoom friends can see it as well. All plugins are welcome, especially the lesser known plugins. I'll hopefully talk around Security Headers for about 10 minutes. For our Zoom friends, I'll send the Zoom link on Monday to those that register. Meetup only allows virtual or real meetings not both. We're at the Maypole, 20a Portugal Place. Cambridge, CB5 8AF. Get to 4 Lamps roundabout, then Jesus Lane, Right into Park Street, and the large multistorey car park is virtually next to the Maypole. The owner, Vincent, asks everyone to buy something. Soft drinks are fine. Food is great. Tap water isn't. He's offering us a lovely room with BT Guest Wifi, a projector & screen, so it needs to work for him. Given that many wonderful members are in America, Europe, and all across the UK, we'll be on Zoom. There will be a few improvements after last month, which seemed to work pretty well for our far flung friends. Thanks to Chris Cox of https://www.hihub.info/ for providing Zoom. I'll send an email out for the Zoom connection for our far flung friends on Monday. This format seems to work. 7:00pm: Welcome. Depending on numbers, everyone gets to introduce themselves, perhaps 10 seconds each, mainly describing their involvement with WordPress. If we have many attendees, then introductions via the chat. 7:10 pm: Start of sharing & discussion. 9.00 There its a desire to wrap up the meeting by about 9pm, and hopefully purchase more from Vincent's bar, then chat about non-wordpress stuff.
)
<?php
// Require the MeetupEvents class
require_once 'MeetupEvents.php';

// Instantiate the class
$meetup = new MeetupEvents();

// Get future events (5) for a random WordPress group
$events = $meetup->get_future_meetup_events('wordpress-cambridge');
$count = 1;

foreach ($events as $event) {
    print "<div class='date {$event['status']}'>{$event['human_date']}</div>";
    print "<div class='event'>{$event['title']}</div>";

    echo '<pre>';
    print_r($event);
    echo '</pre>';

    // Do whatever you want with the $event array

    $count++;
    if ($count > 5) {
        break;
    }
}



// Get past events (5) for a random WordPress group
$events = $meetup->get_past_meetup_events('wordpress-cambridge');
$count = 1;

foreach ($events as $event) {
    print "<div class='date {$event['status']}'>{$event['human_date']}</div>";
    print "<div class='event'>{$event['title']}</div>";

    echo '<pre>';
    print_r($event);
    echo '</pre>';

    // Do whatever you want with the $event array

    $count++;
    if ($count > 5) {
        break;
    }
}

The ideal way to query Meetup.com is either manually or using a weekly CRON job, depending on the groups’ activity. This way, you’ll stay on the safe side and have a fresh update every week.

Related Posts