Testing Oracle rack connections with Erlang

I had an interesting task the other day. There is a new database installation were I work. Two equally balanced servers with Oracle rack installed on them.

The DBA did the installation with some Oracle consultants but it seemed to us the system functioning was rather bizarre. It appeared to us that some connections were getting lost while others were going through. We wanted to gather more detailed data in order to work out what percentages of connections were getting lost.

I chose to program an Erlang application which made concurrent connections to the rack. First I tried running many simultaneous Erlang processes that connected to the rack through odbc. There was only one Erlang node and I soon realised that it wasn't paralelising the connections. Apparently, there is only one odbc server running per windows process (or erlang node, to put it in another way). Thus, even if I created many spawned functions inside one node, there was only one odbc server answering all of them.

To overcome this restriction I programmed a function that runned many erlang nodes using Windows system calls (I was forced to using Windows). Each of this nodes would then run various connections. I observed an increase in the number of sessions and in the number of odbc servers that were creating. Below it's the piece of code that starts nodes on a Windows box.

handle_cast({start_nodes,N},State) when is_integer(N)->
{ok,Host}=inet:gethostname(),
lists:foreach(
fun(I)->
spawn(fun()->; os:cmd("\"C:/path/to/erlang/erl.exe\" -sname stress"++integer_to_list(I)++" -setcookie einw -detached") end),
Node="stress"++integer_to_list(I)++"@"++Host,
receive
after 4000->;ok
end,
rpc:call(list_to_atom(Node),stress_server,start_link,[])
end, lists:seq(1,N)),
io:format("Nodes started!~n",[]),
{noreply, State};

Note the 4 second waiting before starting each node. That is enough time for the Erlang VM to start.

Every one of this nodes has a gen_server waiting to receive requests. In every request, it executes a sql query, waits a bit, executes another query and then returns.

handle_cast({test1,From,Number}, State) when is_integer(Number)->
L=lists:map(
fun(_)->;
{Time,{Value1,Value2}}=timer:tc(?MODULE,connect,[]),
{Time,Value1,Value2,node()}
end,lists:seq(1,Number)),
From ! L,
{noreply, State};

Finally, I just needed another node that called all the started nodes and gathered the results. I started them in two servers and the code below is customised for that specific scenario.

test2()->
NumberOfNodes=10, %per machine
NumberOfTrials=80,
SleepTime=1000,
ListOfNode1=lists:map(fun(I)-> "stress"++integer_to_list(I)++"@server1" end,lists:seq(1,NumberOfNodes)),
ListOfNode2=lists:map(fun(I)-> "stress"++integer_to_list(I)++"@serevr2" end,lists:seq(1,NumberOfNodes)),
ListOfNode=lists:append(ListOfNode1,ListOfNode2),
FileName="name of the file to gather results",
lists:foreach(fun(_)->
lists:foreach(fun(Node)->gen_server:cast({stress_server,list_to_atom(Node)},{test2,self(),SleepTime}) end,ListOfNode)
end,lists:seq(1,NumberOfTrials)),
{ok,File}=file:open(FileName,write),
io:format(File,"time\tinstance\tsessions\tnode~n",[]),
Result=loop(length(ListOfNode) * NumberOfTrials,[]),
lists:foreach(fun({T,I,S,N})->
io:format(File,"~p\t~s\t\t~p\t~p~n",[T,I,S,N])
end,Result),
file:close(File).

loop(0,Acc)->
lists:flatten(Acc);
loop(Processes,Acc)->
receive
L->
loop(Processes-1,[L|Acc])
after 300000->
lists:flatten(Acc)
end.

Saddly the results were very confusing. Depending on the oracle tns settings, we were lossing different percentages of conections. However, there was a point at about 500 connections per node in which the error rate increased. I've got the suspicion that this was due to my machine not being able to handle any more conections. The only way to test this properly its with many computers though. For ones I feel like the limit was imposed by the number of computers rather than the specific specs of one machine.

Comments

Popular posts from this blog

A case against bloom filters in bitcoin

To pray or not to pray