Recently, I did a lot of work with Nokogiri to scrape web pages, however I also found Nokogiri very useful when generating XML documents.

In this post, I will show how to wrap data in a <![CDATA[]]>  section when generating an XML document using Nokogiri.  This is pretty simple to do, but I didn’t find any documentation explaining it.  So, hopefully this will save someone else a few hours of hacking on this simple problem.

As a simple example, lets say that we wanted to describe a sandwich as an XML document that we generate with Nokogiri.  Here’s the code:

require 'rubygems'
require 'nokogiri'

builder = Nokogiri::XML::Builder.new(:encoding => 'utf-8') do |xml|

    xml.send(:sandwich,:xmlns => "http://www.davidrenz.net"){

    xml.item(:type => :bread){
      xml.attribute(){
        xml.cdata "Whole Grain Wheat"
      }
    }

    xml.item(:type => :meat){
      xml.attribute(){
        xml.cdata "Mesquite Tukey Breast"
      }
    } 

    xml.item(:type => :cheese){
      xml.attribute(){
        xml.cdata "Munster"
      }
    }

    xml.item(:type => :condiment){
      xml.attribute(){
        xml.cdata "Tabasco Chipotle Sauce"
      }
    }     

    xml.item(:type => :topping){
      xml.attribute(){
        xml.cdata "Baby Spinach"
      }
    }

    xml.item(:type => :topping){
      xml.attribute{
        xml.cdata "Sliced Avocado"
      }
    }    

    xml.item(:type => :topping){
      xml.attribute("type"){
        xml.cdata "Tomato"
      }
    }

  }
end

Notice that all you need to do to create a CDATA section is to pass a string to the cdata method on the Builder object.

This is the XML generated calling to_xml on the Builder object.


<?xml version="1.0" encoding="utf-8"?>

<sandwich xmlns="http://www.davidrenz.net">

  <item type="bread">

    <attribute><![CDATA[Whole Grain Wheat]]></attribute>

  </item>

  <item type="meat">

    <attribute><![CDATA[Mesquite Tukey Breast]]></attribute>

  </item>

  <item type="cheese">

    <attribute><![CDATA[Munster]]></attribute>

  </item>

  <item type="condiment">

    <attribute><![CDATA[Tabasco Chipotle Sauce]]></attribute> 

  </item>

  <item type="topping">

    <attribute><![CDATA[Baby Spinach]]></attribute>

  </item>

  <item type="topping">

    <attribute><![CDATA[Sliced Avocado]]></attribute>

  </item>

  <item type="topping">

    <attribute>type<![CDATA[Tomato]]></attribute>

  </item>

</sandwich>

{ 0 comments }

In this post, I provide an example of how to use the Twitter4J library to pull tweets that contain specific keywords or from individual users. While this is a very simple example, I hope it helps new comers to Twitter application development, because the simple functionality provided here is the basis for creating many interesting Twitter related applications.

This example can also be run as-is by a non-developer to simply pull Twitter data into a text file.

The code for this example is posted on GitHub at https://github.com/drenz/TwitterFilterStreamParser

Running the Example

Install the required software to download and run the example

Clone the git repository

  • git clone git@github.com:drenz/TwitterFilterStreamParser.git

Update property files

  • Update properties/runtime.properties file to supply your Twitter screen name and password. Twitter requires authentication before opening a data stream.
  • Update the properties/keywords file to specify which keywords you’d like to receive tweets about. For each keyword you would like to track on twitter, add the keyword on its own line. NOTE: keywords cannot have spaces.
  • Update the properties/users to specify which Twitter users you’d like to receive tweets about. For each user you would like to track on twitter, add the user’s Twitter ID number on its own line. You can find a user’s Twitter ID given their screen name  at http://www.idfromuser.com/

Run the application

  • In the project’s base directory, execute: ant run

Next Steps

While this example is functional, there are many things that should be done to create a solid application based on Twitter data.  For instance, DontTweetThat (http://www.donttweetthat.com) uses the same basic principle to collect the tweets that it displays, but is refined to be robust, scalable and flexible.

Here is a brief list things you may want to do while expanding on this example:

  • Handle exceptions that arise from potential network issues, Twitter hiccups or other problems that would stop the parser from collecting data
  • Store tweets in a database rather than a simple text file
  • Once a tweet is received, do some processing to filter out Tweets that may not be of any value
  • Do data analysis on tweets you’ve stored

I intend to do posts on some of these improvements, so stay tuned!

Helpful Resources

Twitter4J: Great documentation and more examples for using the Twitter4J library
Twitter Developers site: Lots of info on the various ways to access Twitter’s and best practices for implementing Twitter-based applications.

{ 0 comments }

Displaying a random selection of items from a large set is common in many web applications. For example, a site may show a random selection of 10 tweets from a database of 10 million or Wikipedia may display a list of 5 random articles from their collection of 20 million. There are multiple ways to accomplish this, but I will walk through one relatively efficient way to solve this problem in a Ruby on Rails.

Consider a web application which displays tweets collected over some period of time. A user should be able to browse these tweets, 10 at a time, in some random order.

With MySQL, placing the Rand() in the order by statement returns rows in a random order, however that is extremely inefficient in applications with large data sets and lots of users.

A more time efficient solution would be to add a field to each row that contains a random number. When selecting tweets from the database, first generate a random number to use as an index into the random column. Then, select tweets where the random index greater is than the chosen random number and ordering by this random index. This select statement efficiently returns tweets in a random order.

Below, I show the steps to creating a very simple Ruby on Rails application that exhibits this technique. A working demonstration of this can be found at in my GitHub repository: drenz/RandomRow

First create a new RoR project:

$ rails new RandomRow

Configure the project to use MySQL. (This is outside the scope of this article)

Generate a Tweet model and the associated controller and views:

$ rails g scaffold Tweet tweet_id:decimal body:string user_id:integer user_screen_name:string profile_image_url:string randomize:integer

Modify the migration file to add an index for randomize :

class CreateTweets < ActiveRecord::Migration   def self.up     create_table :tweets, :id => false do |t|
      t.decimal :tweet_id, :scale => 0, :null => false, :precision => 20, :primary_key => true
      t.string :body
      t.integer :user_id
      t.string :user_screen_name
      t.string :profile_image_url
      t.integer :randomize, :default => 0

      t.timestamps
    end
    add_index :tweets, :randomize
  end

  def self.down
    remove_index :tweets, :randomize
    drop_table :tweets
  end
end

Modify the Tweet model to use tweet_id as the primary key:

class Tweet < ActiveRecord::Base
    self.primary_key = :tweet_id
end

Create the database:

$ rake db:setup RAILS_ENV="development"

At this point, you could populate this skeleton application with tweets by inserting them directly into the database. (See the dev.twitter.com for more info).

I’ve included an sql file (db/tweets.sql) containing sample tweets to test with. Load them with the following command:

$ mysql -u root -p -D RandomRow_development < db/tweets.sql

Modify the tweets_controller index action:

num_to_display = 10

# do this loop until you get random number that will display
# num_to_display tweets
begin
  rand_number = rand(1000000) + 1

  @tweets = Tweet.order('randomize').limit(10).where('randomize >= '+rand_number.to_s)

end while @tweets.size < num_to_display

Modify the view to add a pagination link above the “New Tweet” link:

<%= link_to 'More >>', tweets_path %>

Now when a user clicks “More,” 10 randomly selected tweets are displayed.

When operationalizing this technique, there are a few things must also be done:

  • On insert, generate a random number within the correct range and insert it with the row.
  • Periodically, update the random field on each row with a new random number. This will make sure that users don’t start to see patterns in your randomness.

{ 0 comments }

Memory Resource Management in VMware

October 27, 2010

Advanced memory resource management is one of the key features that sets VMware apart from its competitors.  However, understanding these techniques can be like piecing together a jigsaw puzzle. Fortunately, the information that you need is already available on the web.  In this post, I link to some of the best resources to help you quickly understand VMware memory management [...]

Read the full article →

Opt-in Telepathy

August 6, 2010

It’s definitely cliché to say that the Internet has changed everything, but it’s still fascinating to think about how these changes manifest themselves in the real world and are driving different behaviors and changing personal relationships. For example, the gradual improvements in communication technology have given us real-time access to the intentionally (and sometimes unintentionally) [...]

Read the full article →

Beginning iPhone Development

July 5, 2010

I recently started doing iPhone development and thought I would share some of the resources that I’ve found to be useful so far: Apple Development Center http://developer.apple.com Download the Apple development tools and sign up for an Apple Developer ID. Stanford course iPhone Development search “iphone application development” on iTunes This is the complete series [...]

Read the full article →

Commoditize the Stack, Create More Value

July 2, 2010

Before the 1980′s, computer hardware manufactures controlled the computing industry; hardware was valuable and the software that ran on it was typically include for free. Given this environment, companies like IBM made all of the money because they produced the hardware and created the most value. When IBM released the “personal computer,” in 1981, a [...]

Read the full article →