This project is developing the TwitterEcho research platform which comprises a focused crawler for the twittosphere, which is characterized by a modular distributed architecture. The crawler enables researchers to continuously collect data from particular user communities, while respecting Twitter’s imposed limits. Currently, this platform includes modules for crawling the Portuguese twittosphere. Additional modules can be easily integrated, thus enabling to change the focus to a different community or to perform a topic-focused crawl.
The platform is being developed at the Faculty of Engineering of the University of Porto, in the scope of the REACTION project and in collaboration with SAPO Labs.
The crawler is available open source, strickly for academic research purposes. Download site: http://robinson.fe.up.pt/~projects/twitter_crawler/
Crawler Architecture
The twitterEcho architecture is depicted in the diagram below.
The crawling platform includes a back-end server that manages a distributed crawling process and several thin clients that use the twitter API to collect data.
The server sends lists of usernames to the various lookup clients, which collect the last tweet posted by each of the listed users. The server includes a scheduler that continuously monitors the level of activity of the users and prioritizes the crawling of their tweets based on that level. Thus, the more active users are the more frequently their tweets get crawled.
The platform also includes links clients that crawl information about followers and friends of a given list of users. The new users module expands the list of new users in two ways: i) extracts screenames mentioned (@) or retweet (RT @) from the crawled tweets ii) obtains user IDs from the lists of followers.
The server includes a couple of modules to filter users based on their nationality: profile and language. The current modules were specifically designed to identify Portuguese users, but they can be replaced and /or augmented by other filtering modules, e.g. focused on other communities or focused on specific topics.
The platform also includes modules for data processing – social network and text parsers -that parse the tweets text and lists of followers and generate:
1) network representations of explicit social networks (i.e., network of followers and friends) and implicit social networks (i.e. networks representing reply-to, mentions and retweets activities);
2) network representations of #hashtags and URLs usage patterns.

Hello!
Great to see this project!
-
Marc