Post

ufXtract’s portable social network parser

After 14 months of talking about portable social networks at various events from SXSWi to a small geek dinner, I have finally found the time to build a working example.

ufXtract’s portable social network parser is a combination of the ufXtract microformats parser and a spider which follows rel=”me” links. It has been designed to extract profiles and friends lists from social networks and other sites which have microformats support. The parser returns two main collections of data, all the rel=”me” links and any hCard-XFN patterns.

The parser API /> http://lab.backnetwork.com/ufXtract-psn/

A demo using JavaScript and JSON /> http://lab.backnetwork.com/ufXtract-psn/demo01.htm

The Parser

You can restrict the parser to a single domain or spider across the whole web. Currently, there are limits to the number of pages which will be parsed. Each collection item is given an additional source-url attribute to identify its origin

There is support for both XML and JSON output, for both client and server-side development.

The parser also uses a version of the representative hCard concept, which tries to identify the hCard representing the profile owner. The implementation is a little more complex than described on the microformats wiki as it extends over multiple pages and domains. This means you may find multiple representative hCards from one call to the API, but there should only ever be one per a URL.

The Demo

I believe there are a number of different ways that this functionality could be designed into web sites. So I have provided a simple interface design to demonstrate one possibility. It’s a bit of a homage to the getsatisfaction.com registration page with a few extra twists. I would like to thank my co-worker James Wragg who created the JavaScript for the demo.

Of the sites listed on the demo last.fm and ma.gnolia.com return the best results. The other sites have differing levels of portable social network support. It also works well against blogs such as adactio.com or tantek.com that are marked-up with rel=”me” . It’s worth trying out the two depth search levels.

Pages not parsing

You may find on some sites like twitter.com only certain pages are parsed. These sites often have good microformats support, but parts of their functionally are locked behind logon’s. The parser does not support authenticated sessions as this would mean asking the user to pass me their log-in details which is a really bad idea. If I can lay my hands on a good Open-ID and/or OAuth C# libraries, I will try and implement some different types of authentication.

Research

This is all research work still under development, I placed it on the web for others to experiment with and to help foster discussion. I hope you enjoy playing with this.

  • Microformats
  • Projects

Data formats:

API