# Mbox Parsing
```elixir
Mix.install([
{:explorer, "~> 0.10.1"},
{:kino, "~> 0.15.3"},
{:kino_explorer, "~> 0.1.24"}
])
```
## Analyze your mailbox to find the junk!
[](https://livebook.dev/run?url=https%3A%2F%2Fgist.github.com%2Fpetermueller%2Fa664ef33f38cb2726bf3e0239798beb7)
Go to Google Takeout and initiate an export.
It'll take a little bit.
Once unzipped/untarred, update the `path` variable below.
Test out the size of the `Enum.take` below, as this livebook is not particularly efficient, and a large `.mbox` file can cause timeouts.
```elixir
import Kino.Shorts
alias Explorer.DataFrame, as: DF
:ok
```
```elixir
path =
"~/Documents/Takeout/Mail/All mail Including Spam and Trash.mbox"
|> Path.expand()
# Just to confirm it's working :)
first_few =
path
|> File.stream!()
|> Stream.map(&String.trim/1)
|> Enum.take(10)
tree(first_few)
```
```elixir
chunk_fun = fn
<<"From ", _rest::binary>> = line, [] ->
{:cont, [line]}
<<"From ", _rest::binary>> = line, acc ->
{:cont, Enum.reverse(acc), [line]}
line, acc ->
{:cont, [line | acc]}
end
after_fun = fn
[] -> raise "Won't happen, but let's not hang if we mess up"
[<<"From ", _rest::binary>> = line] -> {:cont, [line]}
acc -> {:cont, Enum.reverse(acc), []}
end
stream =
File.stream!(path)
|> Stream.map(&String.trim_trailing(&1, "\n"))
|> Stream.chunk_while([], chunk_fun, after_fun)
```
```elixir
empty_msg_map = Map.from_keys([:delivered_to, :from, :to, :subject], nil)
lines_to_keep = fn
<<"From ", _rest::binary>> -> []
<<"Delivered-To: ", rest::binary>> -> [delivered_to: rest]
<<"From: ", rest::binary>> -> [from: rest]
<<"To: ", rest::binary>> -> [to: rest]
<<"Subject: ", rest::binary>> -> [subject: rest]
_ -> []
end
formatted_stream =
stream
|> Stream.flat_map(fn lines ->
[Enum.flat_map(lines, lines_to_keep)]
end)
|> Stream.map(&Enum.into(&1, empty_msg_map))
df =
formatted_stream
|> Enum.take(4000)
|> DF.new()
```
<!-- livebook:{"attrs":"eyJhc3NpZ25fdG8iOm51bGwsImNvbGxlY3QiOmZhbHNlLCJkYXRhX2ZyYW1lIjoiZGYiLCJkYXRhX2ZyYW1lX2FsaWFzIjoiRWxpeGlyLkRGIiwiaXNfZGF0YV9mcmFtZSI6dHJ1ZSwibWlzc2luZ19yZXF1aXJlIjoiRWxpeGlyLkV4cGxvcmVyLkRhdGFGcmFtZSIsIm9wZXJhdGlvbnMiOlt7ImFjdGl2ZSI6dHJ1ZSwiY29sdW1ucyI6WyJmcm9tIl0sImRhdGFfb3B0aW9ucyI6eyJkZWxpdmVyZWRfdG8iOiJzdHJpbmciLCJmcm9tIjoic3RyaW5nIiwic3ViamVjdCI6InN0cmluZyIsInRvIjoic3RyaW5nIn0sIm9wZXJhdGlvbl90eXBlIjoiZ3JvdXBfYnkifSx7ImFjdGl2ZSI6dHJ1ZSwiY29sdW1ucyI6WyJmcm9tIl0sImRhdGFfb3B0aW9ucyI6eyJkZWxpdmVyZWRfdG8iOiJzdHJpbmciLCJmcm9tIjoic3RyaW5nIiwic3ViamVjdCI6InN0cmluZyIsInRvIjoic3RyaW5nIn0sIm9wZXJhdGlvbl90eXBlIjoic3VtbWFyaXNlIiwicXVlcnkiOiJjb3VudCJ9LHsiYWN0aXZlIjp0cnVlLCJkYXRhX29wdGlvbnMiOnsiZnJvbSI6InN0cmluZyIsImZyb21fY291bnQiOiJpbnRlZ2VyIn0sImRpcmVjdGlvbiI6ImRlc2MiLCJvcGVyYXRpb25fdHlwZSI6InNvcnRpbmciLCJzb3J0X2J5IjoiZnJvbV9jb3VudCJ9XX0","chunks":null,"kind":"Elixir.KinoExplorer.DataTransformCell","livebook_object":"smart_cell"} -->
```elixir
require Explorer.DataFrame
df
|> DF.lazy()
|> DF.group_by("from")
|> DF.summarise(from_count: count(from))
|> DF.sort_by(desc: from_count)
```